Investigating Loss Functions and Optimization Methods for Discriminative
Learning of Label Sequences
Yasemin Altun
Computer Science
Brown University
Providence, RI 02912
altun@cs.brown.edu
Mark Johnson
Cognitive and Linguistic Sciences
Brown University
Providence, RI 02912
Mark Johnson@brown.edu
Thomas Hofmann
Computer Science
Brown University
Providence, RI 02912
th@cs.brown.edu
Abstract
Discriminative models have been of inter-
est in the NLP community in recent years.
Previous research has shown that they
are advantageous over generative mod-
els. In this paper, we investigate how dif-
ferent objective functions and optimiza-
tion methods affect the performance of the
classifiers in the discriminative learning
framework. We focus on the sequence la-
belling problem, particularly POS tagging
and NER tasks. Our experiments show
that changing the objective function is not
as effective as changing the features in-
cluded in the model.
1 Introduction
Until recent years, generative models were the most
common approach for many NLP tasks. Recently,
there is a growing interest on discriminative mod-
els in the NLP community, and these models were
shown to be successful for different tasks(Lafferty
et al., 2001; Ratnaparkhi, 1999; Collins, 2000). Dis-
criminative models do not only have theoretical ad-
vantages over generative models, as we discuss in
Section 2, but they are also shown to be empirically
favorable over generative models when features and
objective functions are fixed (Klein and Manning,
2002).
In this paper, we use discriminative models to
investigate the optimization of different objective
functions by a variety of optimization methods. We
focus on label sequence learning tasks. Part-of-
Speech (POS) tagging and Named Entity Recogni-
tion (NER) are the most studied applications among
these tasks. However, there are many others, such
as chunking, pitch accent prediction and speech edit
detection. These tasks differ in many aspects, such
as the nature of the label sequences (chunks or indi-
vidual labels), their difficulty and evaluation meth-
ods. Given this variety, we think it is worthwhile to
investigate how optimizing different objective func-
tions affects performance. In this paper, we varied
the scale (exponential vs logarithmic) and the man-
ner of the optimization (sequential vs pointwise) and
using different combinations, we designed 4 differ-
ent objective functions. We optimized these func-
tions on NER and POS tagging tasks. Despite our
intuitions, our experiments show that optimizing ob-
jective functions that vary in scale and manner do
not affect accuracy much. Instead, the selection of
the features has a larger impact.
The choice of the optimization method is impor-
tant for many learning problems. We would like
to use optimization methods that can handle a large
number of features, converge fast and return sparse
classifiers. The importance of the features, and
therefore the importance of the ability to cope with
a larger number of features is well-known. Since
training discriminative models over large corpora
can be expensive, an optimization method that con-
verges fast might be advantageous over others. A
sparse classifier has a shorter test time than a denser
classifier. For applications in which the test time is
crucial, optimization methods that result in sparser
classifiers might be preferable over other methods
a0a1a0a1a0
a0a1a0a1a0
a0a1a0a1a0
a2a1a2a1a2
a2a1a2a1a2
a2a1a2a1a2
a3a1a3a1a3
a3a1a3a1a3
a3a1a3a1a3
a4a1a4a1a4
a4a1a4a1a4
a4a1a4a1a4
a5a1a5a1a5
a5a1a5a1a5
a5a1a5a1a5
a6a1a6a1a6
a6a1a6a1a6
a6a1a6a1a6
x(t+1)x(t)x(t−1) x(t+1)x(t)x(t−1)
y(t+1)y(t)y(t−1) y(t+1)y(t)y(t−1)
a) HMM b)CRF
Figure 1: Graphical representation of HMMs and
CRFs. Shaded areas indicate variables that the
model conditions on.
even if their training time is longer. In this paper we
investigate these aspects for different optimization
methods, i.e. the number of features, training time
and sparseness, as well as the accuracy. In some
cases, an approximate optimization that is more ef-
ficient in one of these aspects might be preferable to
the exact method, if they have similar accuracy. We
experiment with exact versus approximate as well
as parallel versus sequential optimization methods.
For the exact methods, we use an off-the-shelf gradi-
ent based optimization routine. For the approximate
methods, we use a perceptron and a boosting algo-
rithm for sequence labelling which update the fea-
ture weights parallel and sequentially respectively.
2 Discriminative Modeling of Label
Sequences Learning
Label sequence learning is, formally, the problem
of learning a function that maps a sequence of ob-
servations a7a9a8 a10a12a11a14a13a16a15a17a11a19a18a20a15a22a21a23a21a23a21a23a15a17a11a1a24a26a25 to a label sequence
a27
a8
a10a12a28 a13 a15a17a28 a18 a15a22a21a23a21a23a21a23a15a17a28 a24 a25 , where each a28a30a29a32a31a34a33 , the set of
individual labels. For example, in POS tagging, the
words a11a35a29 ’s construct a sentence a7 , and a27 is the la-
belling of the sentence where a28 a29 is the part of speech
tag of the word a11 a29 . We are interested in the super-
vised learning setting, where we are given a corpus,
a36
a8
a10
a7
a13
a15
a27
a13
a25a37a15a38a10
a7
a18
a15
a27
a18
a25a37a15a22a21a23a21a23a21a23a15a38a10
a7a40a39
a15
a27
a39
a25 in order to learn
the classifier.
The most popular model for label sequence learn-
ing is the Hidden Markov Model (HMM). An HMM,
as a generative model, is trained by finding the joint
probability distribution over the observation and la-
bel sequencesa41 a10a7 a15 a27 a25 that explains the corpus a36 the
best (Figure 1a). In this model, each random vari-
able is assumed to be independent of the other ran-
dom variables, given its parents. Because of the long
distance dependencies of natural languages that can-
not be modeled by sequences, this conditional inde-
pendence assumption is violated in many NLP tasks.
Another shortcoming of this model is that, due to its
generative nature, overlapping features are difficult
to use in HMMs. For this reason, HMMs have been
standardly used with current word-current label, and
previous label(s)-current label features. However,
if we incorporate information about the neighboring
words and/or information about more detailed char-
acteristics of the current word directly to our model,
rather than propagating it through the previous la-
bels, we may hope to learn a better classifier.
Many different models, such as Maximum En-
tropy Markov Models (MEMMs) (McCallum et al.,
2000), Projection based Markov Models (PMMs)
(Punyakanok and Roth, 2000) and Conditional Ran-
dom Fields (CRFs) (Lafferty et al., 2001), have been
proposed to overcome these problems. The common
property of these models is their discriminative ap-
proach. They model the probability distribution of
the label sequences given the observation sequences:
a41
a10
a27a43a42
a7
a25 .
The best performing models of label sequence
learning are MEMMs or PMMs (also known as
Maximum Entropy models) whose features are care-
fully designed for the specific tasks (Ratnaparkhi,
1999; Toutanova and Manning, 2000). However,
maximum entropy models suffer from the so called
label bias problem, the problem of making local de-
cisions (Lafferty et al., 2001). Lafferty et al. (2001)
show that CRFs overcome the label-bias problem
and outperform MEMMs in POS tagging.
CRFs define a probability distribution over the
whole sequence a27 , globally conditioning over the
whole observation sequence a7 (Figure 1b). Be-
cause they condition on the observation (as opposed
to generating it), they can use overlapping features.
The features a44 a10a7 a15 a27 a15a17a45a17a25 used in this paper are of the
form:
1. Current label and information about the obser-
vation sequence, such as the identity or spelling
features of a word that is within a window
of the word currently labelled. Each of these
features corresponds to a choice of a28 a29 and a11a47a46
where a48 a31a50a49a16a45a52a51a54a53a55a15a22a21a23a21a23a21a23a15a17a45a37a15a22a21a23a21a23a21a23a15a17a45a14a56a57a53a59a58 and a53 is the
half window size
2. Current label and the neighbors of that label,
i.e. features that capture the inter-label depen-
dencies. Each of these features corresponds to
a choice of a28 a29 and the neighbors of a28 a29 , e.g. in a
bigram model, a44 a10a12a28 a29a1a0 a13a22a15a17a28 a29a25 .
The conditional probability distribution defined
by this model is :
a2a4a3 a10
a27a43a42
a7
a25
a8
a5a7a6a9a8a11a10a13a12
a29
a12a15a14a4a16 a14
a44
a14
a10
a7
a15
a27
a15a17a45 a25a18a17
a19a20a3 a10
a7
a25
where a16 a14 ’s are the parameters to be estimated from
the training corpus C and a19a21a3 a10a7 a25 is a normaliza-
tion term to assure a proper probability distribu-
tion. In order to simplify the notation, we in-
troduce a22 a14 a10a7 a15 a27 a25 a8 a12 a29 a44 a14 a10a7 a15 a27 a15a17a45a17a25 , which is the
number of times feature a44 a14 is observed in a10a7 a15 a27 a25
pair and, a23 a3 a10a7 a15 a27 a25 a8 a12 a14 a16 a14 a22 a14 a10a7 a15 a27 a25 , which is
the linear combination of all the features with
a16 parameterization.
a22
a14 is the sufficient statis-
tic of a16 a14 . Then, we can rewrite a2a24a3 a10a27 a42a7 a25 as:
a5a7a6a9a8a11a10
a23
a3 a10
a7
a15
a27
a25a18a17a26a25
a12a28a27 a5a7a6a29a8a30a10
a23
a3 a10
a7
a15
a27
a25a18a17 .
3 Loss Functions for Label Sequences
Given the theoretical advantages of discriminative
models over generative models and the empirical
support by (Klein and Manning, 2002), and that
CRFs are the state-of-the-art among discriminative
models for label sequences, we chose CRFs as our
model, and trained by optimizing various objective
functions a31 a3 a10a36 a25 with respect to the corpus a36 .
The application of these models to the label
sequence problems vary widely. The individual
labels might constitute chunks (e.g. Named-Entity
Recognition, shallow parsing), or they may be
single entries (e.g. POS tagging). The difficulty,
therefore the accuracy of the tasks are very different
from each other. The evaluation of the systems
differ from one task to another, and the nature of the
statistical noise level is task and corpus dependent.
Given this variety, using objective functions tailored
for each task might result in better classifiers. We
consider two dimensions in designing objective
functions: exponential versus logarithmic loss func-
tions, and sequential versus pointwise optimization
functions.
3.1 Exponential vs Logarithmic Loss functions
Most estimation procedures in NLP proceed by
maximizing the likelihood of the training data. To
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.5
1
1.5
2
2.5
3
3.5
Loss
Pθ(yi|xi)
Penalization of loss functions
0−1 loss
exp−loss
log−loss
Figure 2: Loss values of 0-1, exp and log loss func-
tions in a binary classification problem
overcome the numerical problems of working with
a product of a large number of small probabilities,
usually the logarithm of the likelihood of the data
is optimized. However, most of the time, these sys-
tems, sequence labelling systems in particular, are
tested with respect to their error rate on test data, i.e.
the fraction of times the function a23 a3 assigns a higher
score to a label sequence a27 (such that a27a33a32a8 a27a24a34 ) than
the correct label sequence a27a4a34 for every observation
a7
a34 in test data. Then, the rank loss of
a23
a3 might be a
more natural objective to minimize.
a31a36a35
a10
a16a38a37
a36 a25
a8a40a39
a34
a39
a27a42a41
a43
a27a45a44a47a46
a10
a23
a3 a10
a7
a34
a15
a27
a25a26a51
a23
a3 a10
a7
a34
a15
a27 a34
a25a17a25
a31 a35 is the total number of label sequences that a23
a3
ranks higher than the correct label sequences for the
training instances in the corpus a36 . Since optimizing
the rank loss is NP-complete, one can optimize an
upper bound instead, e.g. an exponential loss func-
tion:
a31 a35a35
a10
a16a48a37
a36 a25
a8a49a39
a34
a39
a27a48a41
a43
a27 a44
a5a7a6a9a8a51a50
a23
a3 a10
a7
a34
a15
a27
a25a26a51
a23
a3 a10
a7
a34
a15
a27 a34
a25a18a52
The exponential loss function is well studied in
the Machine Learning domain. The advantage of
the exp-loss over the log-loss is its property of pe-
nalizing incorrect labellings very severely, whereas
it penalizes almost nothing when the label sequence
is correct. This is a very desirable property for a
classifier. Figure 2 shows this property of exp-loss
in contrast to log-loss in a binary classification prob-
lem. However this property also means that, exp-
loss has the disadvantage of being sensitive to noisy
data, since systems optimizing exp-loss spends more
effort on the outliers and tend to be vulnerable to
noisy data, especially label noise.
3.2 Sequential vs Pointwise Loss functions
In many applications it is very difficult to get the
whole label sequence correct since most of the time
classifiers are not perfect and as the sequences get
longer, the probability of predicting every label in
the sequence correctly decreases exponentially. For
this reason performance is usually measured point-
wise, i.e. in terms of the number of individual la-
bels that are correctly predicted. Most common op-
timization functions in the literature, however, treat
the whole label sequence as one label, penalizing
a label sequence that has one error and a label se-
quence that is all wrong in the same manner. We
may be able to develop better classifiers by using
a loss function more similar to the evaluation func-
tion. One possible way of accomplishing this may
be minimizing pointwise loss functions. Sequential
optimizations optimize the joint conditional proba-
bility distribution a2 a3 a10a27a43a42a7 a25 , whereas pointwise op-
timizations that we propose optimize the marginal
conditional probability distribution, a2a20a3 a10a12a28 a29 a42a7 a34 a25 a8
a12 a27
a1a2a4a3 a43 a2
a44
a3
a2 a3 a10
a27a43a42
a7
a34
a25 .
3.3 Four Loss functions
We derive four loss functions by taking the cross
product of the two dimensions discussed above:
a5 Sequential Log-loss function: This function,
based on the standard maximum likelihood op-
timization, is used with CRFs in (Lafferty et al.,
2001).
a31
a13
a3
a10a36 a25
a8
a51
a39
a34
a6a8a7a10a9 a2 a3 a10
a27 a34 a42
a7
a34
a25
a8
a51
a39
a34
a23
a3 a10
a7
a34
a15
a27 a34
a25a40a56 a6a8a7a10a9 a19a20a3 a10
a7
a34
a25 (1)
a5 Sequential Exp-loss function: This loss func-
tion, was first introduced in (Collins, 2000) for
NLP tasks with a structured output domain.
However, there, the sum is not over the whole
possible label sequence set, but over the a11
best label sequences generated by an external
mechanism. Here we include all possible la-
bel sequences; so we do not require an external
mechanism to identify the best a11 sequences..
As shown in (Altun et al., 2002) it is possible
to sum over all label sequences by using a dy-
namic algorithm.
a31
a18
a3
a10a36 a25
a8 a39
a34a13a12
a27a42a41
a43
a27 a44
a5a7a6a9a8 a50
a23
a3 a10
a7
a34
a15
a27
a25a26a51
a23
a3 a10
a7
a34
a15
a27 a34
a25 a52
a8 a39
a34
a50
a2 a3 a10
a27 a34 a42
a7
a34
a25
a0 a13
a51a15a14 a52 (2)
Note that the exponential loss function is just
the inverse conditional probability plus a con-
stant.
a5 Pointwise Log-loss function: This function op-
timizes the marginal probability of the labels at
each position conditioning on the observation
sequence:
a31
a16
a3
a10a36 a25
a8
a51
a39
a34
a39
a29
a6a8a7a10a9 a2a4a3 a10a12a28
a34
a29
a42
a7
a34
a25
a8
a51
a39
a34a13a12
a29
a6a8a7a10a9
a39
a27
a1a2 a3 a43 a2
a44
a3
a2a4a3 a10
a27a43a42
a7
a34
a25
Obviously, this function reduces to the sequen-
tial log loss if the length of the sequence is a14 .
a5 Pointwise Exp-loss function: Following the
parallelism in log-loss vs exp-loss functions of
sequential optimization (log vs inverse condi-
tional probability), we propose minimizing the
pointwise exp-loss function below, which re-
duces to the standard multi-class exponential
loss when the length of the sequence is a14 .
a31
a17
a3
a10a36 a25
a8 a39
a34a13a12
a29
a12
a27
a5a7a6a29a8
a18a19
a23
a3 a10
a7
a34
a15
a27
a25a12a51 a6a20a7a10a9
a39
a27a22a21
a12
a2
a21
a3
a43 a2
a44
a3
a5a7a6a9a8a42a50
a23
a3 a10
a7
a34
a15
a27
a35
a25a52a23a24
a8 a39
a34a13a12
a29
a2 a3 a10a12a28
a34
a29
a42
a7
a34
a25
a0 a13
4 Comparison of the Four Loss Functions
We now compare the performance of the four loss
functions described above. Although (Lafferty et
al., 2001) proposes a modification of the iterative
scaling algorithm for parameter estimation in se-
quential log-loss function optimization, gradient-
based methods have often found to be more efficient
for minimizing the convex loss function in Eq. (1)
(Minka, 2001). For this reason, we use a gradient
based method to optimize the above loss functions.
4.1 Gradient Based Optimization
The gradients of the four loss function can be com-
puted as follows:
a5 Sequential Log-loss function:
a0 a3
a31
a13
a8 a39
a34
a1
a50
a22
a10
a7
a15
a27
a25
a42
a7
a34
a52 a51
a22
a10
a7
a34
a15
a27 a34
a25 (3)
where expectations are taken w.r.t. a2a20a3 a10a27a43a42a7 a25 .
Thus at the optimum the empirical and ex-
pected values of the sufficient statistics are
equal. The loss function and the derivatives
can be calculated with one pass of the forward-
backward algorithm.
a5 Sequential Exp-loss function:
a0 a3
a31
a18
a8 a39
a34
a1a3a2
a22
a10
a7
a15
a27
a25
a42
a7
a34a5a4
a51
a22
a10
a7
a34
a15
a27 a34
a25
a2 a3 a10
a27 a34 a42
a7
a34
a25 (4)
At the optimum the empirical values of the suf-
ficient statistics equals their conditional expec-
tations where the contribution of each instance
is weighted by the inverse conditional proba-
bility of the instance. Thus this loss function
focuses on the examples that have a lower con-
ditional probability, which are usually the ex-
amples that the model labels incorrectly. The
computational complexity is the same as the
log-loss case.
a5 Pointwise Log-loss function:
a0 a3
a31
a16
a8 a39
a34a13a12
a29
a1
a50
a22
a10
a7
a15
a27
a25
a42
a7
a34
a52 a51
a1
a50
a22
a10
a7
a15
a27
a25
a42
a7
a34
a15a17a28
a34
a29
a52
At the optimum the expected value of the suf-
ficient statistics conditioned on the observation
a7
a34 are equal to their expected value when also
conditioned on the correct label sequence a28 a34a29 .
The computations can be done using the dy-
namic programming described in (Kakade et
al., 2002), with the computational complexity
of the forward-backward algorithm scaled by a
constant.
a5 Pointwise Exp-loss function:
a0 a3
a31
a17
a8 a39
a34a13a12
a29
a1a3a2
a22
a10
a7
a15
a27
a25
a42
a7
a34 a4
a51 a1a6a2
a22
a10
a7
a15
a27
a25
a42
a7
a34
a15a17a28
a34
a29
a4
a2a4a3 a10a12a28
a34
a29
a42
a7
a34
a25
At the optimum the expected value of the suf-
ficient statistics conditioned on a11 a34 are equal to
the value when also conditioned on a28 a34a29 , where
each point is weighted by a2 a3 a10a12a28 a34a29 a42a7
a34
a25
a0 a13
. Com-
putational complexity is the same as the log-
loss case.
4.2 Experimental Setup
Before presenting the experimental results of the
comparison of the four loss functions described
above, we describe our experimental setup. We ran
experiments on Part-of-Speech (POS) tagging and
Named-Entity-Recognition (NER) tasks.
For POS tagging, we used the Penn TreeBank cor-
pus. There are 47 individual labels in this corpus.
Following the convention in POS tagging, we used
a Tag Dictionary for frequent words. We used Sec-
tions 1-21 for training and Section 22 for testing.
For NER, we used a Spanish corpus which was
provided for the Special Session of CoNLL2002 on
NER. There are training and test data sets and the
training data consists of about 7200 sentences. The
individual label set in the corpus consists of 9 la-
bels: the beginning and continuation of Person, Or-
ganization, Location and Miscellaneous names and
nonname tags.
We used three different feature sets:
a5a8a7
a14 is the set of bigram features, i.e. the current
tag and the current word, the current tag and
previous tags.
a5a8a7a10a9 consists of a7
a14 features and spelling fea-
tures of the current word (e.g. ”Is the current
word capitalized and the current tag is Person-
Beginning?”). Some of the spelling features,
which are mostly adapted from (Bikel et al.,
1999) are the last one, two and three letters of
the word; whether the first letter is lower case,
upper case or alphanumeric; whether the word
is capitalized and contains a dot; whether all the
letters are capitalized; whether the word con-
tains a hyphen.
a5a8a7a10a11 includes a7a12a9 features not only for the current
word but also for the words within a fixed win-
dow of size a53 . a7a12a9 is an instance of a7a12a11 where
a53
a8
a14 . An example of
a7a12a11 features for
a53a14a13
a11
is ”Does the previous word ends with a dot and
the current tag is Organization-Intermediate?”.
POS a31
a13
a31
a18
a31
a16
a31
a17
a7
a14 94.91 94.57 94.90 94.66
a7a12a9 95.68 95.25 95.71 95.31
Table 1: Accuracy of POS tagging on Penn Tree-
Bank.
For NER, we used a window of size 3 (i.e. consid-
ered features for the previous and next words). Since
the Penn TreeBank is very large, including a7a12a11 fea-
tures, i.e. incorporating the information in the neigh-
boring words directly to the model, is intractable.
Therefore, we limited our experiments to a7 a14 and a7a12a9
features for POS tagging.
4.3 Experimental Results
As a gradient based optimization method, we used
an off-the-shelf optimization tool that uses the
limited-memory updating method. We observed that
this method is faster to converge than the conju-
gate gradient descent method. It is well known that
optimizing log-loss functions may result in over-
fitting, especially with noisy data. For this rea-
son, we used a regularization term in our cost func-
tions. We experimented with different regularization
terms. As expected, we observed that the regular-
ization term increases the accuracy, especially when
the training data is small; but we did not observe
much difference when we used different regulariza-
tion terms. The results we report are with the Gaus-
sian prior regularization term described in (Johnson
et al., 1999).
Our goal in this paper is not to build the best tag-
ger or recognizer, but to compare different loss func-
tions and optimization methods. Since we did not
spend much effort on designing the most useful fea-
tures, our results are slightly worse than, but compa-
rable to the best performing models.
We extracted corpora of different sizes (ranging
from 300 sentences to the complete corpus) and ran
experiments optimizing the four loss functions us-
ing different feature sets. In Table 1 and Table 2,
we report the accuracy of predicting every individ-
ual label. It can be seen that the test accuracy ob-
tained by different loss functions lie within a rela-
tively small range and the best performance depends
on what kind of features are included in the model.
NER a31
a13
a31
a18
a31
a16
a31
a17
a7
a14 59.92 59.68 56.73 58.26
a7a12a9 69.75 67.30 68.28 69.51
a7a12a11 73.62 72.11 73.17 73.82
Table 2: F1 measure of NER on Spanish newswire
corpus. The window size is 3 for a7a12a11 .
We observed similar behavior when the training set
is smaller. The accuracy is highest when more fea-
tures are included to the model. From these results
we conclude that when the model is the same, opti-
mizing different loss functions does not have much
effect on the accuracy, but increasing the variety of
the features included in the model has more impact.
5 Optimization methods
In Section 4, we showed that optimizing differ-
ent loss function does not have a large impact on
the accuracy. In this section, we investigate differ-
ent methods of optimization. The conjugate based
method used in Section 4 is an exact method. If
the training corpus is large, the training may take
a long time, especially when the number of features
are very large. In this method, the optimization is
done in a parallel fashion by updating all of the pa-
rameters at the same time. Therefore, the resulting
classifier uses all the features that are included in the
model and lacks sparseness.
We now consider two approximation methods to
optimize two of the loss functions described above.
We first present a perceptron algorithm for labelling
sequences. This algorithm performs parallel opti-
mization and is an approximation of the sequential
log-loss optimization. Then, we present a boosting
algorithm for label sequence learning. This algo-
rithm performs sequential optimization by updating
one parameter at a time. It optimizes the sequential
exp-loss function. We compare these methods with
the exact method using the experimental setup pre-
sented in Section 4.2.
5.1 Perceptron Algorithm for Label Sequences
Calculating the gradients, i.e. the expectations of
features for every instance in the training corpus
can be computationally expensive if the corpus is
very large. In many cases, a single training instance
might be as informative as all of the corpus to update
the parameters. Then, an online algorithm which
makes updates by using one training example may
converge much faster than a batch algorithm. If the
distribution is peaked, one label is more likely than
others and the contribution of this label dominates
the expectation values. If we assume this is the case,
i.e. we make a Viterbi assumption, we can calculate
a good approximation of the gradients by consider-
ing only the most likely, i.e. the best label sequence
according to the current model. The following on-
line perceptron algorithm (Algorithm 1), presented
in (Collins, 2002), uses these two approximations:
Algorithm 1 Label sequence Perceptron algorithm .
1: initialize a0a2a1 a15 a16 a14 a8a4a3
2: repeat
3: for all training patterns a7 a34 do
4: compute a5a27 a8a7a6a9a8 a9a11a10 a6 a6 a27 a23 a3 a10a7 a34 a15 a27 a25
5: if a27 a34 a32a8a12a5a27 then
6: a16 a14a14a13 a16 a14
a56
a22
a14
a10
a7
a34
a15
a27 a34
a25 a51
a22
a14
a10
a7
a34
a15
a5
a27
a25
7: end if
8: end for
9: until stopping criteria
At each iteration, the perceptron algorithm calcu-
lates an approximation of the gradient of the sequen-
tial log-loss function (Eq. 3) based on the current
training instance. The batch version of this algo-
rithm is a closer approximation of the optimization
of sequential log-loss, since the only approximation
is the Viterbi assumption. The stopping criteria may
be convergence, or a fixed number of iterations over
the training data.
5.2 Boosting Algorithm for Label Sequences
The original boosting algorithm (AdaBoost), pre-
sented in (Schapire and Singer, 1999), is a sequen-
tial learning algorithm to induce classifiers for sin-
gle random variables. (Altun et al., 2002) presents a
boosting algorithm for learning classifiers to predict
label sequences. This algorithm minimizes an upper
bound on the sequential exp-loss function (Eq. 2).
As in AdaBoost, a distribution over observations is
defined:
a15 a10a17a16 a25a19a18
a2 a3 a10a17a20
a34 a42a21 a34
a25
a0 a13
a51 a14
a12a23a22 a10
a2a4a3 a10a17a20
a22
a42a21
a22
a25
a0 a13
a51a15a14 a17 (5)
NER a31
a13
Perceptron a31
a18
Boosting
a7
a14 59.92 59.77 59.68 48.23
a7a12a9 69.75 69.29 67.30 66.11
a7a12a11 73.62 72.97 72.11 71.07
Table 3: F1 of different methods for NER
This distribution which expresses the importance of
every training instance is updated at each round, and
the algorithm focuses on the more difficult exam-
ples. The sequence Boosting algorithm (Algorithm
2) optimizes an upper bound on the sequential exp-
loss function by using the convexity of the exponen-
tial function. a24a26a25a28a27a30a29a14 is the maximum difference of the
sufficient statistic a22 a14 in any label sequence and the
correct label sequence of any observation a7 a34 . a24a31a25a33a32a35a34a14
has a similar meaning. a24a37a36
a14
a18
a24 a25a28a27a30a29
a14
a51
a24 a25a33a32a35a34
a14 .
Algorithm 2 Label sequence Boosting algorithm.
1: initialize a15 a10a17a16 a15 a27 a25 a8
a13
a12
a44a39a38a40
a44
a38
a15
a0
a16
2: repeat
3: for all features a44
a14 do
4: a41 a14 a8 a12
a34
a42a44a43
a34a46a45
a47a49a48a50
a51
a24 a25a28a27a30a29
a14
a56a53a52
a10a54
a50 a43a56a55
a12
a27
a45
a38
a55
a44 a17
a0
a54
a50 a43a57a55
a44
a12
a27 a44
a45
a58a9a59a60a43
a27 a44
a38
a55
a44
a45
a0 a13 a61
5: a62 a16 a14 a8
a13
a47 a48a50
a6a8a7a10a9
a51
a0a64a63
a50
a47a66a65a68a67a69a50
a43
a13 a0a64a63
a50
a45
a47 a65a68a70a72a71a50
a61
6: end for
7: a44 a8a7a6a9a8 a9a11a10a74a73a46a75 a14 a50a41 a14a39a76a39a77
a3
a50
a47 a65a68a67a69a50 a56 a10 a14 a51
a41
a14
a25
a76a66a77
a3
a50
a47 a65a68a70a72a71a50 a52
8: a16a9a78 a13 a16a9a78
a56
a62
a16a9a78
9: Update a15 a10a17a16 a25
10: until stopping criteria
As it can be seen from Line 4 in Algorithm 2, the
feature that was added to the ensemble at each round
is determined by a function of the gradient of the se-
quential exp-loss function (Eq. 4). At each round,
one pass of the forward backward algorithm over the
training data is sufficient to calculate a41 a14 ’s for all a1 .
Considering the sparseness of the features in each
training instance, one can restrict the forward back-
ward pass only to the training instances that contain
the feature that is added to the ensemble in the last
round. The stopping criteria may be a fixed number
of rounds, or by cross-validation on a heldout cor-
pus.
6 Experimental Results
The results summarized in Table 3 compares the per-
ceptron and the boosting algorithm with the gradi-
ent based method. Performance of the standard per-
ceptron algorithm fluctuates a lot, whereas the aver-
age perceptron is more stable. We report the results
of the average perceptron here. Not surprisingly, it
does slightly worse than CRF, since it is an approx-
imation of CRFs. The advantage of the Perceptron
algorithm is its dual formulation. In the dual form,
explicit feature mapping can be avoided by using the
kernel trick and one can have a large number of fea-
tures efficiently. As we have seen in the previous
sections, the ability to incorporate more features has
a big impact on the accuracy. Therefore, a dual per-
ceptron algorithm may have a large advantage over
other methods.
When only HMM features are used, Boosting as
a sequential algorithm performs worse than the gra-
dient based method that optimizes in a parallel fash-
ion. This is because there is not much information
in the HMM features other than the identity of the
word to be labeled. Therefore, the boosting algo-
rithm needs to include almost all the features one by
one in the ensemble. When there are just a few more
informative features, the boosting algorithm makes
better use of them. This situation is more dramatic
in POS tagging. Boosting gets 89.42% and 94.92%
accuracy for a7 a14 and a7a12a9 features, whereas the gra-
dient based method gets 94.57% and 95.25%. The
gradient based method uses all of the available fea-
tures, whereas boosting uses only about 10% of the
features. Due to the loose upper bound that Boosting
optimizes, the estimate of the updates are very con-
servative. Therefore, the same features are selected
many times. This negatively effects the convergence
time, and the other methods outperform Boosting in
terms of training time.
7 Conclusion and Future Work
In this paper, we investigated how different objec-
tive functions and optimization methods affect the
accuracy of the sequence labelling task in the dis-
criminative learning framework. Our experiments
show that optimizing different objective functions
does not have a large affect on the accuracy. Ex-
tending the feature space is more effective. We con-
clude that methods that can use large, possibly infi-
nite number of features may be advantageous over
others. We are running experiments where we use a
dual formulation of the perceptron algorithm which
has the property of being able to use infinitely many
features. Our future work includes using SVMs for
label sequence learning task.

References
Y. Altun, T. Hofmann, and M. Johnson. 2002. Discriminative
learning for label sequences via boosting. In Proceedings of
NIPS*15.
Daniel M. Bikel, Richard L. Schwartz, and Ralph M.
Weischedel. 1999. An algorithm that learns what’s in a
name. Machine Learning, 34(1-3):211–231.
M. Collins. 2000. Discriminative reranking for natural lan-
guage parsing. In Proceedings of ICML 2002.
M. Collins. 2002. Ranking algorithms for named-entity extrac-
tion: Boosting and the voted perceptron. In Proceedings of
ACL’02.
M. Johnson, S. Geman, S. Canon, Z. Chi, and S. Riezler. 1999.
Estimators for stochastic unification-based grammars. In
Proceedings of ACL’99.
S. Kakade, Y.W. Teh, and S. Roweis. 2002. An alternative
objective function for Markovian fields. In Proceedings of
ICML 2002.
Dan Klein and Christopher D. Manning. 2002. Conditional
structure versus conditional estimation in nlp models. In
Proceedings of EMNLP 2002.
J. Lafferty, A. McCallum, and F. Pereira. 2001. Conditional
random fields: Probabilistic models for segmenting and la-
beling sequence data. In Proceedings of ICML2001.
A. McCallum, D. Freitag, and F. Pereira. 2000. Maximum En-
tropy Markov Models for Information Extraction and Seg-
mentation. In Proceedings of ICML 2000.
T. Minka. 2001. Algorithms for maximum-likelihood logistic
regression. Technical report, CMU, Department of Statis-
tics, TR 758.
V. Punyakanok and D. Roth. 2000. The use of classifiers in
sequential inference. In Proceedings of NIPS*13.
Adwait Ratnaparkhi. 1999. Learning to parse natural language
with maximum entropy models. Machine Learning, 34(1-
3):151–175.
R. Schapire and Y. Singer. 1999. Improved boosting algo-
rithms using confidence-rated predictions. Machine Learn-
ing, 37(3):297–336.
Kristina Toutanova and Christopher Manning. 2000. Enrich-
ing the knowledge sources used in a maximum entropy pos
tagger. In Proceedings of EMNLP 2000.
