Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the ACL, pages 209–216,
Sydney, July 2006. c©2006 Association for Computational Linguistics
Semi-Supervised Conditional Random Fields for Improved Sequence
Segmentation and Labeling
Feng Jiao
University of Waterloo
Shaojun Wang Chi-Hoon Lee
Russell Greiner Dale Schuurmans
University of Alberta
Abstract
We present a new semi-supervised training
procedure for conditional random fields
(CRFs) that can be used to train sequence
segmentors and labelers from a combina-
tion of labeled and unlabeled training data.
Our approach is based on extending the
minimum entropy regularization frame-
work to the structured prediction case,
yielding a training objective that combines
unlabeled conditional entropy with labeled
conditional likelihood. Although the train-
ing objective is no longer concave, it can
still be used to improve an initial model
(e.g. obtained from supervised training)
by iterative ascent. We apply our new
training algorithm to the problem of iden-
tifying gene and protein mentions in bio-
logical texts, and show that incorporating
unlabeled data improves the performance
of the supervised CRF in this case.
1 Introduction
Semi-supervised learning is often touted as one
of the most natural forms of training for language
processing tasks, since unlabeled data is so plen-
tiful whereas labeled data is usually quite limited
or expensive to obtain. The attractiveness of semi-
supervised learning for language tasks is further
heightened by the fact that the models learned are
large and complex, and generally even thousands
of labeled examples can only sparsely cover the
parameter space. Moreover, in complex structured
prediction tasks, such as parsing or sequence mod-
eling (part-of-speech tagging, word segmentation,
named entity recognition, and so on), it is con-
siderably more difficult to obtain labeled training
data than for classification tasks (such as docu-
ment classification), since hand-labeling individ-
ual words and word boundaries is much harder
than assigning text-level class labels.
Many approaches have been proposed for semi-
supervised learning in the past, including: genera-
tive models (Castelli and Cover 1996; Cohen and
Cozman 2006; Nigam et al. 2000), self-learning
(Celeux and Govaert 1992; Yarowsky 1995), co-
training (Blum and Mitchell 1998), information-
theoretic regularization (Corduneanu and Jaakkola
2006; Grandvalet and Bengio 2004), and graph-
based transductive methods (Zhou et al. 2004;
Zhou et al. 2005; Zhu et al. 2003). Unfortu-
nately, these techniques have been developed pri-
marily for single class label classification prob-
lems, or class label classification with a struc-
tured input (Zhou et al. 2004; Zhou et al. 2005;
Zhu et al. 2003). Although still highly desirable,
semi-supervised learning for structured classifica-
tion problems like sequence segmentation and la-
beling have not been as widely studied as in the
other semi-supervised settings mentioned above,
with the sole exception of generative models.
With generative models, it is natural to include
unlabeled data using an expectation-maximization
approach (Nigam et al. 2000). However, gener-
ative models generally do not achieve the same
accuracy as discriminatively trained models, and
therefore it is preferable to focus on discriminative
approaches. Unfortunately, it is far from obvious
how unlabeled training data can be naturally in-
corporated into a discriminative training criterion.
For example, unlabeled data simply cancels from
the objective if one attempts to use a traditional
conditional likelihood criterion. Nevertheless, re-
cent progress has been made on incorporating un-
labeled data in discriminative training procedures.
For example, dependencies can be introduced be-
tween the labels of nearby instances and thereby
have an effect on training (Zhu et al. 2003; Li and
McCallum 2005; Altun et al. 2005). These models
are trained to encourage nearby data points to have
the same class label, and they can obtain impres-
sive accuracy using a very small amount of labeled
data. However, since they model pairwise similar-
ities among data points, most of these approaches
require joint inference over the whole data set at
test time, which is not practical for large data sets.
In this paper, we propose a new semi-supervised
training method for conditional random fields
(CRFs) that incorporates both labeled and unla-
beled sequence data to estimate a discriminative
209
structured predictor. CRFs are a flexible and pow-
erful model for structured predictors based on
undirected graphical models that have been glob-
ally conditioned on a set of input covariates (Laf-
ferty et al. 2001). CRFs have proved to be partic-
ularly useful for sequence segmentation and label-
ing tasks, since, as conditional models of the la-
bels given inputs, they relax the independence as-
sumptions made by traditional generative models
like hidden Markov models. As such, CRFs pro-
vide additional flexibility for using arbitrary over-
lapping features of the input sequence to define a
structured conditional model over the output se-
quence, while maintaining two advantages: first,
efficient dynamic program can be used for infer-
ence in both classification and training, and sec-
ond, the training objective is concave in the model
parameters, which permits global optimization.
To obtain a new semi-supervised training algo-
rithm for CRFs, we extend the minimum entropy
regularization framework of Grandvalet and Ben-
gio (2004) to structured predictors. The result-
ing objective combines the likelihood of the CRF
on labeled training data with its conditional en-
tropy on unlabeled training data. Unfortunately,
the maximization objective is no longer concave,
but we can still use it to effectively improve an
initial supervised model. To develop an effective
training procedure, we first show how the deriva-
tive of the new objective can be computed from
the covariance matrix of the features on the unla-
beled data (combined with the labeled conditional
likelihood). This relationship facilitates the devel-
opment of an efficient dynamic programming for
computing the gradient, and thereby allows us to
perform efficient iterative ascent for training. We
apply our new training technique to the problem of
sequence labeling and segmentation, and demon-
strate it specifically on the problem of identify-
ing gene and protein mentions in biological texts.
Our results show the advantage of semi-supervised
learning over the standard supervised algorithm.
2 Semi-supervised CRF training
In what follows, we use the same notation as (Laf-
ferty et al. 2001). Let a0 be a random variable over
data sequences to be labeled, and a1 be a random
variable over corresponding label sequences. All
components, a1a3a2 , of a1 are assumed to range over
a finite label alphabet a4 . For example, a0 might
range over sentences and a1 over part-of-speech
taggings of those sentences; hence a4 would be the
set of possible part-of-speech tags in this case.
Assume we have a set of labeled examples,
a5a7a6a9a8a11a10a13a12a15a14a17a16a19a18a21a20a23a22a25a24a26a16a27a18a21a20a29a28a30a22a32a31a32a31a32a31a33a22a34a12a15a14a17a16a36a35a37a20a38a22a25a24a39a16a36a35a37a20a41a40 , and unla-
beled examples, a5a7a42a43a8 a10 a14a17a16a36a35a37a44a45a18a21a20a25a22a32a31a32a31a32a31a45a22a25a14a17a16a36a46a47a20 a40 . We
would like to build a CRF model
a48a50a49
a12a15a24a37a51a14a45a28a52a8 a53
a54
a49
a12a15a14a45a28a56a55a58a57a60a59
a10a62a61
a63a64a32a65
a18a67a66
a64a69a68a70a64
a12a15a14a39a22a25a24a56a28 a40
a8 a53
a54
a49
a12a15a14a45a28 a55a58a57a60a59
a10a72a71
a66
a22
a68
a12a15a14a39a22a25a24a56a28a25a73 a40
over sequential input and output data a14a39a22a25a24 ,
where
a66
a8 a12
a66
a18
a22a32a31a32a31a32a31a33a22
a66
a61
a28a29a74 ,
a68
a12a15a14a39a22a25a24a56a28 a8
a12
a68
a18
a12a15a14a39a22a25a24a56a28a30a22a32a31a32a31a32a31a17a22
a68
a61
a12a15a14a39a22a25a24a56a28a25a28 a74 and
a54
a49
a12a15a14a45a28a75a8
a63a77a76
a55a58a57a60a59
a10a72a71
a66
a22
a68
a12a15a14a39a22a25a24a56a28a25a73 a40
Our goal is to learn such a model from the com-
bined set of labeled and unlabeled examples, a5 a6a70a78
a5a79a42 .
The standard supervised CRF training proce-
dure is based upon maximizing the log conditional
likelihood of the labeled examples in a5a43a6
a80a82a81 a12
a66
a28a75a8
a35
a63a83 a65
a18a85a84a87a86a89a88
a48a50a49
a12a15a24 a16
a83
a20 a51a14 a16
a83
a20 a28a56a90a92a91a47a12
a66
a28 (1)
where a91a7a12
a66
a28 is any standard regularizer on
a66
, e.g.
a91a47a12
a66
a28a93a8a95a94
a66
a94a38a96a77a97a70a98 . Regularization can be used to
limit over-fitting on rare features and avoid degen-
eracy in the case of correlated features. Obviously,
(1) ignores the unlabeled examples in a5 a42 .
To make full use of the available training data,
we propose a semi-supervised learning algorithm
that exploits a form of entropy regularization on
the unlabeled data. Specifically, for a semi-
supervised CRF, we propose to maximize the fol-
lowing objective
a99 a81 a12
a66
a28a100a8
a35
a63a83 a65
a18a85a84a87a86a89a88
a48a50a49
a12a15a24 a16
a83
a20 a51a14 a16
a83
a20 a28a39a90a92a91a7a12
a66
a28 (2)
a101 a102
a46
a63a83 a65
a35a37a44a45a18
a63a34a76
a48a50a49
a12a15a24a37a51a14 a16
a83
a20 a28
a84a87a86a89a88
a48a50a49
a12a15a24a37a51a14 a16
a83
a20 a28
where the first term is the penalized log condi-
tional likelihood of the labeled data under the
CRF, (1), and the second line is the negative con-
ditional entropy of the CRF on the unlabeled data.
Here, a102 is a tradeoff parameter that controls the
influence of the unlabeled data.
210
This approach resembles that taken by (Grand-
valet and Bengio 2004) for single variable classi-
fication, but here applied to structured CRF train-
ing. The motivation is that minimizing conditional
entropy over unlabeled data encourages the algo-
rithm to find putative labelings for the unlabeled
data that are mutually reinforcing with the super-
vised labels; that is, greater certainty on the pu-
tative labelings coincides with greater conditional
likelihood on the supervised labels, and vice versa.
For a single classification variable this criterion
has been shown to effectively partition unlabeled
data into clusters (Grandvalet and Bengio 2004;
Roberts et al. 2000).
To motivate the approach in more detail, con-
sider the overlap between the probability distribu-
tion over a label sequence a0 and the empirical dis-
tribution of a1a48 a12a15a14a45a28 on the unlabeled data a5a47a42 . The
overlap can be measured by the Kullback-Leibler
divergence a2 a12 a48 a49 a12a15a24a37a51a14a45a28 a1a48 a12a15a14a45a28a34a94 a1a48 a12a15a14a45a28a25a28 . It is well
known that Kullback-Leibler divergence (Cover
and Thomas 1991) is positive and increases as the
overlap between the two distributions decreases.
In other words, maximizing Kullback-Leibler di-
vergence implies that the overlap between two dis-
tributions is minimized. The total overlap over all
possible label sequences can be defined as
a63a32a76
a2
a12
a48a50a49
a12a15a24a37a51a14a45a28
a1
a48
a12a15a14a45a28a34a94
a1
a48
a12a15a14a45a28a25a28
a8
a63a34a76 a63
a3a5a4a7a6a9a8
a48a50a49
a12a15a24a37a51a14a45a28
a1
a48
a12a15a14a45a28
a84a87a86a89a88
a48a50a49
a12a15a24a37a51a14a45a28
a1
a48
a12a15a14a45a28
a1
a48
a12a15a14a45a28
a8
a63
a3a5a4a7a6a10a8
a1
a48
a12a15a14a45a28
a63a34a76
a48a50a49
a12a15a24a37a51a14a45a28
a84 a86a89a88
a48a50a49
a12a15a24a37a51a14a45a28
which motivates the negative entropy term in (2).
The combined training objective (2) exploits
unlabeled data to improve the CRF model, as
we will see. However, one drawback with this
approach is that the entropy regularization term
is not concave. To see why, note that the en-
tropy regularizer can be seen as a composition,
a11 a12
a66
a28 a8
a68
a12a13a12 a12
a66
a28a25a28 , where
a68
a14a16a15a18a17a19a20a17a22a21 a15 ,
a68
a12a13a12a60a28 a8
a23
a76
a12
a76
a84a87a86a89a88
a12
a76
and a12
a76
a14a24a15
a61
a21 a15 ,
a12
a76
a12
a66
a28 a8
a18
a25a27a26
a16 a3 a20 a55a58a57a60a59
a10 a23
a61
a64a32a65
a18
a66
a64a70a68a70a64
a12a15a14a39a22a25a24a56a28 a40 . For scalar
a66
, the
second derivative of a composition, a11 a8
a68
a28
a12 , is
given by (Boyd and Vandenberghe 2004)
a11a30a29a29 a12
a66
a28 a8a31a12 a29 a12
a66
a28 a74a5a32a47a96
a68
a12a13a12 a12
a66
a28a25a28a33a12 a29 a12
a66
a28
a101
a32
a68
a12a13a12 a12
a66
a28a25a28 a74 a12 a29a29 a12
a66
a28
Although
a68
and a12a35a34 are concave here, since
a68
is not
nondecreasing, a11 is not necessarily concave. So in
general there are local maxima in (2).
3 An efficient training procedure
As (2) is not concave, many of the standard global
maximization techniques do not apply. However,
one can still use unlabeled data to improve a su-
pervised CRF via iterative ascent. To derive an ef-
ficient iterative ascent procedure, we need to com-
pute gradient of (2) with respect to the parameters
a66
. Taking derivative of the objective function (2)
with respect to
a66
yields Appendix A for the deriva-
tion)a36
a36
a66
a99 a81 a12
a66
a28 (3)
a8
a35
a63a83 a65
a18
a37
a68
a12a15a14 a16
a83
a20 a22a25a24 a16
a83
a20 a28a56a90
a63a32a76
a48a50a49
a12a15a24a37a51a14 a16
a83
a20 a28
a68
a12a15a14 a16
a83
a20 a22a25a24 a16
a83
a20 a28a39a38
a90
a36
a36
a66
a91a47a12
a66
a28
a101 a102
a46
a63a83 a65
a35 a44a45a18a27a40 a86a7a41a27a42
a26
a16
a76
a17
a3a5a43a45a44a47a46 a20a5a48
a68
a12a15a14 a16
a83
a20 a22a25a24a56a28a33a49
a66
The first three items on the right hand side are
just the standard gradient of the CRF objective,a36
a80a82a81 a12
a66
a28a25a97
a36
a66
(Lafferty et al. 2001), and the final
item is the gradient of the entropy regularizer (the
derivation of which is given in Appendix A.
Here,
a40 a86a7a41 a42
a26
a16
a76
a17
a3 a43a45a44a50a46 a20a52a51
a68
a12a15a14a17a16
a83
a20a23a22a25a24a56a28a54a53 is the condi-
tional covariance matrix of the features,
a68a56a55
a12a15a14a39a22a25a24a56a28 ,
given sample sequence a14 a16
a83
a20 . In particular, the
a12a58a57a13a22a60a59 a28 th element of this matrix is given by
a40 a86a7a41 a42
a26
a16
a76
a17
a3 a20 a48
a68 a55
a12a15a14a39a22a25a24a56a28
a68a70a64
a12a15a14a39a22a25a24a56a28a33a49
a8a62a61
a42
a26
a16
a76
a17
a3 a20
a10
a68 a55
a12a15a14a39a22a25a24a56a28
a68a70a64
a12a15a14a39a22a25a24a56a28 a40
a90a63a61
a42
a26
a16
a76
a17
a3 a20
a10
a68a64a55
a12a15a14a39a22a25a24a56a28 a40a65a61
a42
a26
a16
a76
a17
a3 a20
a10
a68a70a64
a12a15a14a39a22a25a24a56a28 a40
a8
a63a77a76
a48a50a49
a12a15a24a37a51a14a45a28 a10
a68 a55
a12a15a14a39a22a25a24a56a28
a68a70a64
a12a15a14a39a22a25a24a56a28 a40 (4)
a90 a10
a63a34a76
a48a50a49
a12a15a24a37a51a14a45a28
a68 a55
a12a15a14a39a22a25a24a56a28 a40 a10
a63a77a76
a48 a49
a12a15a24 a51a14a45a28
a68a70a64
a12a15a14a39a22a25a24a17a28 a40
To efficiently calculate the gradient, we need
to be able to efficiently compute the expectations
with respect to a24 in (3) and (4). However, this
can pose a challenge in general, because there are
exponentially many values for a24 . Techniques for
computing the linear feature expectations in (3)
are already well known if a24 is sufficiently struc-
tured (e.g. a24 forms a Markov chain) (Lafferty et
al. 2001). However, we now have to develop effi-
cient techniques for computing the quadratic fea-
ture expectations in (4).
For the quadratic feature expectations, first note
that the diagonal terms, a57 a8a67a66 , are straightfor-
ward, since each feature is an indicator, we have
211
that
a68 a55
a12a15a14a39a22a25a24a17a28 a96 a8
a68 a55
a12a15a14a39a22a25a24a56a28 , and therefore the diag-
onal terms in the conditional covariance are just
linear feature expectations a61
a42
a26
a16
a76
a17
a3 a20
a10
a68 a55
a12a15a14a39a22a25a24a56a28 a96 a40 a8
a61
a42
a26
a16
a76
a17
a3 a20
a10
a68 a55
a12a15a14a39a22a25a24a56a28 a40 as before.
For the off diagonal terms, a57a1a0a8 a66 , however,
we need to develop a new algorithm. Fortunately,
for structured label sequences, a1 , one can devise
an efficient algorithm for calculating the quadratic
expectations based on nested dynamic program-
ming. To illustrate the idea, we assume that the
dependencies of a1 , conditioned on a0 , form a
Markov chain.
Define one feature for each state pair a12 a0 a29 a22 a0 a28 ,
and one feature for each state-observation pair
a12
a0
a22a3a2a82a28 , which we express with indicator functionsa68
a34a5a4a7a6a34 a12 a71a9a8 a22a3a10 a73a30a22a25a24a37a51a12a11
a42
a6a13a15a14 a22a25a14a45a28 a8 a16a67a12a15a24
a42
a22
a0
a29 a28a17a16 a12a15a24a18a13a72a22
a0
a28 and
a12 a34a19a6a20 a12a9a10 a22a25a24a37a51a13 a22a25a14a45a28 a8 a16 a12a15a24 a13 a22
a0
a28a17a16a67a12a15a14 a13 a22a3a2a82a28 respectively.
Following (Lafferty et al. 2001), we also add spe-
cial start and stop states, a1a22a21 a8 start and a1a24a23 a44a45a18 a8
stop. The conditional probability of a label se-
quence can now be expressed concisely in a ma-
trix form. For each position a57 in the observation
sequence a14 , define the a51a4 a51a26a25 a51a4 a51 matrix random
variable a27
a55
a12a15a14a45a28a26a8a29a28
a27
a55
a12
a0
a29 a22
a0
a51a14a45a28a31a30 by
a27
a55
a12
a0
a29 a22
a0
a51a14a45a28 a8
a55a58a57a60a59
a12a33a32
a55
a12
a0
a29 a22
a0
a51a14a45a28a25a28 where
a32
a55
a12
a0
a29 a22
a0
a51a14a45a28 a8
a63 a64a35a34
a64a70a68a70a64a37a36
a38
a55
a22a25a24 a51a39a9a40 a8 a12
a0
a29 a22
a0
a28a30a22a25a14a26a41
a101
a63a89a64a43a42
a64
a12
a64 a36
a10
a55
a22a25a24a37a51a13a44a40 a8
a0
a22a25a14 a41
Here a38
a55
is the edge with labels a12 a1
a55
a45
a18
a22
a1
a55
a28 and a10
a55
is the vertex with label a1
a55
.
For each index a57a79a8a47a46a60a22a32a31a32a31a32a31a45a22a3a48 a101 a53 define the for-
ward vectors a49
a55
a12a15a14a45a28 with base case
a49 a21
a12
a0
a51a14a45a28a26a8a51a50 a53a53a52a55a54 a0
a8a47a56a3a57a59a58a61a60a5a57
a46
a86
a57a59a62
a55
a60a59a63
a52
a56
a55
and recurrence
a49
a55
a12a15a14a45a28a26a8
a49
a55
a45
a18
a12a15a14a45a28
a27
a55
a12a15a14a45a28
Similarly, the backward vectors a64
a55
a12a15a14a45a28 are given by
a64 a23
a44a45a18
a12
a0
a51a14a45a28 a8 a50 a53a53a52a55a54 a0
a8a47a56a3a57
a86
a59
a46
a86
a57a59a62
a55
a60a59a63
a52
a56
a55
a64
a55
a12a15a14a45a28 a8
a27
a55
a44a45a18
a12a15a14a45a28
a64
a55
a44a45a18
a12a15a14a45a28
With these definitions, the expectation of
the product of each pair of feature func-
tions, a12
a68a64a55
a12a15a14a39a22a25a24a56a28a30a22
a68a70a64
a12a15a14a39a22a25a24a56a28a25a28 , a12
a68 a55
a12a15a14a39a22a25a24a56a28a30a22 a12
a64
a12a15a14a39a22a25a24a56a28a25a28 ,
and a12a13a12
a55
a12a15a14a39a22a25a24a56a28a30a22 a12
a64
a12a15a14a39a22a25a24a56a28a25a28 , for a57a89a22a60a59a75a8
a53
a22a32a31a32a31a32a31a45a22a59a65 ,
a57a66a0a8 a59 , can be recursively calculated.
First define the summary matrix
a27a68a67
a44a45a18
a6a69
a45
a18
a12
a0
a22
a0
a29 a51a14a45a28 a8 a10
a69
a45
a18
a70
a6
a65
a67
a44a45a18
a27
a6
a12a15a14a45a28 a40
a34a19a6a34a5a4
Then the quadratic feature expectations can be
computed by the following recursion, where the
two double sums in each expectation correspond
to the two cases depending on which feature oc-
curs first (a38 a67 occuring before a38 a69 ).
a61
a42
a26
a16
a76
a17
a3 a20
a10
a68 a55
a12a15a14a39a22a25a24a56a28
a68a70a64
a12a15a14a39a22a25a24a56a28 a40
a8
a63
a3a5a4a7a6a9a8
a23
a44a45a18
a63
a67
a6a69
a65
a18
a6
a67a3a71
a69
a63
a34a5a4a72a6a34
a68 a55a73a36
a38
a67
a22a25a24a37a51a39a75a74 a8 a12
a0
a29 a22
a0
a28a30a22a25a14 a41
a63
a34 a4a4 a6a34 a4a4a4
a68a70a64 a36
a38
a69a25a22a25a24 a51a39a77a76a39a8 a12
a0
a29a29 a22
a0
a29a29a29 a28a30a22a25a14 a41
a49a78a67 a45
a18
a12
a0
a29 a51a14a45a28
a27a79a67
a12
a0
a29 a22
a0
a51a14a45a28
a27a79a67
a44a45a18
a6a69
a45
a18
a12
a0
a22
a0
a29a29 a51a14a45a28
a27
a69 a12
a0
a29a29 a22
a0
a29a29a29 a51a14a45a28
a64
a69 a12
a0
a29a29a29 a51a14a45a28a25a97a32a54
a49
a12a15a14a45a28
a101
a63
a3a27a4a56a6 a8
a23
a44a45a18
a63
a67
a6a69
a65
a18
a6a69
a71a80a67
a63
a34a5a4a7a6a34
a68 a55 a36
a38
a69a25a22a25a24a37a51a39a77a76a39a8 a12
a0
a29 a22
a0
a28a30a22a25a14a26a41
a63
a34a15a4a4a72a6a34a5a4a4a4
a68a70a64a81a36
a38
a67
a22a25a24a37a51a39a75a74 a8 a12
a0
a29a29 a22
a0
a29a29a29 a28a30a22a25a14a26a41
a49
a69
a45
a18
a12
a0
a29a29a29 a51a14a45a28
a27
a69 a12
a0
a29a29a29 a22
a0
a29a29 a51a14a45a28
a27
a69
a44a45a18
a6
a67 a45
a18
a12
a0
a29a29 a22
a0
a29 a51a14a45a28
a27a68a67
a12
a0
a29 a22
a0
a51a14a45a28
a64
a69a30a12
a0
a51a14a45a28a25a97a32a54
a49
a12a15a14a45a28
a61
a42
a26
a16
a76
a17
a3 a20
a10
a68 a55
a12a15a14a39a22a25a24a56a28a33a12
a64
a12a15a14a39a22a25a24a56a28 a40
a8
a63
a3a5a4a7a6 a8
a23
a44a45a18
a63
a67
a6a69
a65
a18
a6
a67a3a82
a69
a63
a34 a4 a6a34
a68 a55 a36
a38
a67
a22a25a24a37a51a39 a74 a8 a12
a0
a29 a22
a0
a28a30a22a25a14 a41
a63
a34a15a4a4
a12
a64 a36
a10a61a69 a22a25a24 a51a13 a76 a8
a0
a29a29 a22a25a14 a41
a49a78a67 a45
a18
a12
a0
a29 a51a14a45a28
a27a68a67
a12
a0
a29 a22
a0
a51a14a45a28
a27a68a67
a44a45a18
a6a69
a45
a18
a12
a0
a22
a0
a29a29 a51a2 a28
a64
a69a38a12
a0
a29a29 a51a14a45a28a25a97a32a54
a49
a12a15a14a45a28
a101
a63
a20
a4a7a6 a8
a23
a44a45a18
a63
a67
a6a69
a65
a18
a6a69
a71a80a67
a63
a34a15a4a83a6a34
a68 a55 a36
a38
a69a29a22a25a24a37a51a39a75a76a39a8 a12
a0
a29 a22
a0
a28a30a22a25a14 a41
a63
a34a5a4a4
a12
a64 a36
a10
a67
a22a25a24a37a51a13a3a74 a8
a0
a29a29 a22a25a14 a41
a49
a69
a45
a18
a12
a0
a29a29 a51a14a45a28
a27
a69
a44a45a18
a6
a67 a45
a18
a12
a0
a29a29 a22
a0
a29 a51a2a82a28
a27 a67
a12
a0
a29 a22
a0
a51a14a45a28
a64
a69 a12
a0
a51a14a45a28a25a97a32a54
a49
a12a15a14a45a28
a61
a42
a26
a16
a76
a17
a3 a20
a10 a12
a55
a12a15a14a39a22a25a24a56a28a33a12
a64
a12a15a14a39a22a25a24a17a28 a40
a8
a63
a3a5a4a7a6a9a8
a23
a44a45a18
a63
a67
a6a69
a65
a18
a6
a67a3a71
a69
a63
a34 a4
a12
a55 a36
a10
a67
a22a25a24a37a51a13 a74 a8
a0
a29 a22a25a14 a41
a63
a34
a12
a64
a12a84a10a85a69a29a22a25a24a37a51a13a3a76a13a8
a0
a22a25a14a45a28
212
a49 a67 a45
a18
a12
a0
a29 a51a14a45a28
a27 a67
a44a45a18
a6a69
a45
a18
a12
a0
a29 a22
a0
a51a14a45a28
a64
a69 a12
a0
a51a14a45a28
a54
a49
a12a15a14a45a28
a101
a63
a3a5a4a7a6 a8
a23
a44a45a18
a63
a67
a6a69
a65
a18
a6a69
a71a80a67
a63
a34 a4
a12
a55 a36
a10 a69 a22a25a24a37a51a13 a76a70a8
a0
a29 a22a25a14 a41
a63
a34 a4
a12
a64
a12a33a10
a67
a22a25a24a37a51a13a3a74 a8
a0
a22a25a14a45a28
a49
a69
a45
a18
a12
a0
a51a14a45a28
a27
a69
a44a45a18
a6
a67 a45
a18
a12
a0
a22
a0
a29 a51a14a45a28
a64
a69a30a12
a0
a29 a51a14a45a28
a54
a49
a12a15a14a45a28
The computation of these expectations can be or-
ganized in a trellis, as illustrated in Figure 1.
Once we obtain the gradient of the objective
function (2), we use limited-memory L-BFGS, a
quasi-Newton optimization algorithm (McCallum
2002; Nocedal and Wright 2000), to find the local
maxima with the initial value being set to be the
optimal solution of the supervised CRF on labeled
data.
4 Time and space complexity
The time and space complexity of the semi-
supervised CRF training procedure is greater
than that of standard supervised CRF training,
but nevertheless remains a small degree poly-
nomial in the size of the training data. Let
a0
a6 = size of the labeled set
a0
a42 = size of the unlabeled set
a48
a6 = labeled sequence length
a48
a42 = unlabeled sequence length
a48a18a69 = test sequence length
a1 = number of states
a2 = number of training iterations.
Then the time required to classify a test sequence
is a3 a12a9a48a18a69 a1 a96a77a28 , independent of training method, since
the Viterbi decoder needs to access each path.
For training, supervised CRF training requires
a3 a12
a2a4a0
a6
a48
a6
a1
a96 a28 time, whereas semi-supervised CRF
training requires a3 a12 a2a4a0 a6 a48 a6 a1 a96 a101 a2a4a0 a42 a48a85a96a42 a1a6a5 a28 time.
The additional cost for semi-supervised training
arises from the extra nested loop required to cal-
culated the quadratic feature expectations, which
introduces in an additional a48 a42 a1 factor.
However, the space requirements of the two
training methods are the same. That is, even
though the covariance matrix has size a3 a12a33a65 a96 a28 ,
there is never any need to store the entire matrix in
memory. Rather, since we only need to compute
the product of the covariance with
a66
, the calcu-
lation can be performed iteratively without using
extra space beyond that already required by super-
vised CRF training.
start
0
1
2
stop
Figure 1: Trellis for computing the expectation of a feature
product over a pair of feature functions, a7a9a8a10a8 vs a7a12a11a13a8 , where the
feature a7a9a8a10a8 occurs first. This leads to one double sum.
5 Identifying gene and protein mentions
We have developed our new semi-supervised
training procedure to address the problem of infor-
mation extraction from biomedical text, which has
received significant attention in the past few years.
We have specifically focused on the problem of
identifying explicit mentions of gene and protein
names (McDonald and Pereira 2005). Recently,
McDonald and Pereira (2005) have obtained inter-
esting results on this problem by using a standard
supervised CRF approach. However, our con-
tention is that stronger results could be obtained
in this domain by exploiting a large corpus of un-
annotated biomedical text to improve the quality
of the predictions, which we now show.
Given a biomedical text, the task of identify-
ing gene mentions can be interpreted as a tagging
task, where each word in the text can be labeled
with a tag that indicates whether it is the beginning
of gene mention (B), the continuation of a gene
mention (I), or outside of any gene mention (O).
To compare the performance of different taggers
learned by different mechanisms, one can measure
the precision, recall and F-measure, given by
precision = # correct predictions# predicted gene mentions
recall = # correct predictions# true gene mentions
F-measure = a96a15a14 precision a14 recallprecision
a44 recall
In our evaluation, we compared the proposed
semi-supervised learning approach to the state of
the art supervised CRF of McDonald and Pereira
(2005), and also to self-training (Celeux and Gov-
aert 1992; Yarowsky 1995), using the same fea-
ture set as (McDonald and Pereira 2005). The
CRF training procedures, supervised and semi-
213
supervised, were run with the same regularization
function, a91a47a12
a66
a28a39a8 a94
a66
a94a30a96a34a97a70a98 , used in (McDonald and
Pereira 2005).
First we evaluated the performance of the semi-
supervised CRF in detail, by varying the ratio be-
tween the amount of labeled and unlabeled data,
and also varying the tradeoff parameter a102 . We
choose a labeled training set a0 consisting of 5448
words, and considered alternative unlabeled train-
ing sets, a1 (5210 words), a80 (10,208 words), and
a2 (25,145 words), consisting of the same, 2 times
and 5 times as many sentences as a0 respectively.
All of these sets were disjoint and selected ran-
domly from the full corpus, the smaller one in
(McDonald et al. 2005), consisting of 184,903
words in total. To determine sensitivity to the pa-
rameter a102 we examined a range of discrete values
a46a60a22a59a46a3a2
a53
a22a59a46a3a2a5a4 a22
a53
a22a6a4 a22
a53
a46 a22a38a98a61a46 a22a6a4 a46 .
In our first experiment, we train the CRF models
using labeled set a0 and unlabeled sets a1 , a80 and
a2 respectively. Then test the performance on the
sets a1 , a80 and a2 respectively The results of our
evaluation are shown in Table 1. The performance
of the supervised CRF algorithm, trained only on
the labeled set a0 , is given on the first row in Table
1 (corresponding to a102 a8 a46 ). By comparison, the
results obtained by the semi-supervised CRFs on
the held-out sets a1 , a80 and a2 are given in Table 1
by increasing the value of a102 .
The results of this experiment demonstrate quite
clearly that in most cases the semi-supervised CRF
obtains higher precision, recall and F-measure
than the fully supervised CRF, yielding a 20% im-
provement in the best case.
In our second experiment, again we train the
CRF models using labeled set a0 and unlabeled
sets a1 , a80 and a2 respectively with increasing val-
ues of a102 , but we test the performance on the held-
out set a7 which is the full corpus minus the la-
beled set a0 and unlabeled sets a1 , a80 and a2 . The
results of our evaluation are shown in Table 2 and
Figure 2. The blue line in Figure 2 is the result
of the supervised CRF algorithm, trained only on
the labeled set a0 . In particular, by using the super-
vised CRF model, the system predicted 3334 out
of 7472 gene mentions, of which 2435 were cor-
rect, resulting in a precision of 0.73, recall of 0.33
and F-measure of 0.45. The other curves are those
of the semi-supervised CRFs.
The results of this experiment demonstrate quite
clearly that the semi-supervised CRFs simultane-
0
500
1000
1500
2000
2500
3000
3500
0.1 0.5 1 5 7 10 12 14 16 18 20
gamma
n
u
m
b
e
r
 
o
f
 
c
o
r
r
e
c
t
 
p
r
e
d
i
c
t
i
o
n
 
(
T
P
)
set B
set C
set D
CRF
Figure 2: Performance of the supervised and semi-
supervised CRFs. The sets a8 , a9 and a10 refer to the unlabeled
training set used by the semi-supervised algorithm.
ously increase both the number of predicted gene
mentions and the number of correct predictions,
thus the precision remains almost the same as the
supervised CRF, and the recall increases signifi-
cantly.
Both experiments as illustrated in Figure 2 and
Tables 1 and 2 show that clearly better results
are obtained by incorporating additional unlabeled
training data, even when evaluating on disjoint
testing data (Figure 2). The performance of the
semi-supervised CRF is not overly sensitive to the
tradeoff parameter a102 , except that a102 cannot be set
too large.
5.1 Comparison to self-training
For completeness, we also compared our results to
the self-learning algorithm, which has commonly
been referred to as bootstrapping in natural lan-
guage processing and originally popularized by
the work of Yarowsky in word sense disambigua-
tion (Abney 2004; Yarowsky 1995). In fact, sim-
ilar ideas have been developed in pattern recogni-
tion under the name of the decision-directed algo-
rithm (Duda and Hart 1973), and also traced back
to 1970s in the EM literature (Celeux and Govaert
1992). The basic algorithm works as follows:
1. Given a5 a6 and a5 a42 , begin with a seed set of
labeled examples, a5 a16 a21 a20 , chosen from a5a47a6 .
2. For a0 a8 a46a60a22 a53 a22a32a31a32a31a32a31
(a) Train the supervised CRF on labeled ex-
amples a5 a16a12a11a37a20 , obtaining
a66
a16a13a11a37a20 .
(b) For each sequence a14a26a16
a83
a20a15a14 a5a79a42 , find
a24
a16
a83
a20
a16a12a11 a20
a8 a58a61a60
a88
a16
a58
a57
a76
a48
a49
a43a18a17 a46
a12a15a24a37a51a14a17a16
a83
a20a29a28 via
Viterbi decoding or other inference al-
gorithm, and add the pair a12a15a14 a16
a83
a20 a22a25a24
a16
a83
a20
a16a12a11 a20
a28 to
the set of labeled examples (replacing
any previous label for a14a26a16
a83
a20 if present).
214
Table 1: Performance of the semi-supervised CRFs obtained on the held-out sets a1 , a80 and a2
Test Set B, Trained on A and B Test Set C, Trained on A and C Test Set D, Trained on A and D
a0 Precision Recall F-Measure Precision Recall F-Measure Precision Recall F-Measure
0 0.80 0.36 0.50 0.77 0.29 0.43 0.74 0.30 0.43
0.1 0.82 0.4 0.54 0.79 0.32 0.46 0.74 0.31 0.44
0.5 0.82 0.4 0.54 0.79 0.33 0.46 0.74 0.31 0.44
1 0.82 0.4 0.54 0.77 0.34 0.47 0.73 0.33 0.45
5 0.84 0.45 0.59 0.78 0.38 0.51 0.72 0.36 0.48
10 0.78 0.46 0.58 0.66 0.38 0.48 0.66 0.38 0.47
Table 2: Performance of the semi-supervised CRFs trained by using unlabeled sets a1 , a80 and a2
Test Set E, Trained on A and B Test Set E, Trained on A and C Test Set E, Trained on A and D
a0 # predicted # correct prediction # predicted # correct prediction # predicted # correct prediction
0.1 3345 2446 3376 2470 3366 2466
0.5 3413 2489 3450 2510 3376 2469
1 3446 2503 3588 2580 3607 2590
5 4089 2878 4206 2947 4165 2888
10 4450 2799 4762 2827 4778 2845
(c) If for each a14a26a16
a83
a20 a14 a5a79a42 , a24
a16
a83
a20
a16a12a11 a20
a8 a24
a16
a83
a20
a16a12a11
a45
a18a21a20 ,
stop; otherwise a0 a8 a0 a101 a53 , iterate.
We implemented this self training approach and
tried it in our experiments. Unfortunately, we
were not able to obtain any improvements over the
standard supervised CRF with self-learning, using
the sets a5a7a6a3a8 a0 a22 and a5a79a42 a14a2a1 a1 a22 a80 a22 a2a4a3 . The
semi-supervised CRF remains the best of the ap-
proaches we have tried on this problem.
6 Conclusions and further directions
We have presented a new semi-supervised training
algorithm for CRFs, based on extending minimum
conditional entropy regularization to the struc-
tured prediction case. Our approach is motivated
by the information-theoretic argument (Grand-
valet and Bengio 2004; Roberts et al. 2000) that
unlabeled examples can provide the most bene-
fit when classes have small overlap. An itera-
tive ascent optimization procedure was developed
for this new criterion, which exploits a nested dy-
namic programming approach to efficiently com-
pute the covariance matrix of the features.
We applied our new approach to the problem of
identifying gene name occurrences in biological
text, exploiting the availability of auxiliary unla-
beled data to improve the performance of the state
of the art supervised CRF approach in this do-
main. Our semi-supervised CRF approach shares
all of the benefits of the standard CRF training,
including the ability to exploit arbitrary features
of the inputs, while obtaining improved accuracy
through the use of unlabeled data. The main draw-
back is that training time is increased because of
the extra nested loop needed to calculate feature
covariances. Nevertheless, the algorithm is suf-
ficiently efficient to be trained on unlabeled data
sets that yield a notable improvement in classifi-
cation accuracy over standard supervised training.
To further accelerate the training process of our
semi-supervised CRFs, we may apply stochastic
gradient optimization method with adaptive gain
adjustment as proposed by Vishwanathan et al.
(2006).
Acknowledgments
Research supported by Genome Alberta, Genome Canada,
and the Alberta Ingenuity Centre for Machine Learning.
References
S. Abney. (2004). Understanding the Yarowsky algorithm.
Computational Linguistics, 30(3):365-395.
Y. Altun, D. McAllester and M. Belkin. (2005). Maximum
margin semi-supervised learning for structured variables.
Advances in Neural Information Processing Systems 18.
A. Blum and T. Mitchell. (1998). Combining labeled and
unlabeled data with co-training. Proceedings of the Work-
shop on Computational Learning Theory, 92-100.
S. Boyd and L. Vandenberghe. (2004). Convex Optimization.
Cambridge University Press.
V. Castelli and T. Cover. (1996). The relative value of la-
beled and unlabeled samples in pattern recognition with
an unknown mixing parameter. IEEE Trans. on Informa-
tion Theory, 42(6):2102-2117.
G. Celeux and G. Govaert. (1992). A classification EM al-
gorithm for clustering and two stochastic versions. Com-
putational Statistics and Data Analysis, 14:315-332.
215
I. Cohen and F. Cozman. (2006). Risks of semi-supervised
learning. Semi-Supervised Learning, O. Chapelle, B.
Scholk¨opf and A. Zien, (Editors), 55-70, MIT Press.
A. Corduneanu and T. Jaakkola. (2006). Data dependent
regularization. Semi-Supervised Learning, O. Chapelle,
B. Scholk¨opf and A. Zien, (Editors), 163-182, MIT Press.
T. Cover and J. Thomas, (1991). Elements of Information
Theory, John Wiley & Sons.
R. Duda and P. Hart. (1973). Pattern Classification and
Scene Analysis, John Wiley & Sons.
Y. Grandvalet and Y. Bengio. (2004). Semi-supervised learn-
ing by entropy minimization, Advances in Neural Infor-
mation Processing Systems, 17:529-536.
J. Lafferty, A. McCallum and F. Pereira. (2001). Conditional
random fields: probabilistic models for segmenting and
labeling sequence data. Proceedings of the 18th Interna-
tional Conference on Machine Learning, 282-289.
W. Li and A. McCallum. (2005). Semi-supervised sequence
modeling with syntactic topic models. Proceedings of
Twentieth National Conference on Artificial Intelligence,
813-818.
A. McCallum. (2002). MALLET: A machine learning for
language toolkit. [http://mallet.cs.umass.edu]
R. McDonald, K. Lerman and Y. Jin. (2005). Con-
ditional random field biomedical entity tagger.
[http://www.seas.upenn.edu/ a0 sryantm/software/BioTagger/]
R. McDonald and F. Pereira. (2005). Identifying gene and
protein mentions in text using conditional random fields.
BMC Bioinformatics 2005, 6(Suppl 1):S6.
K. Nigam, A. McCallum, S. Thrun and T. Mitchell. (2000).
Text classification from labeled and unlabeled documents
using EM. Machine learning. 39(2/3):135-167.
J. Nocedal and S. Wright. (2000). Numerical Optimization,
Springer.
S. Roberts, R. Everson and I. Rezek. (2000). Maximum cer-
tainty data partitioning. Pattern Recognition, 33(5):833-
839.
S. Vishwanathan, N. Schraudolph, M. Schmidt and K. Mur-
phy. (2006). Accelerated training of conditional random
fields with stochastic meta-descent. Proceedings of the
23th International Conference on Machine Learning.
D. Yarowsky. (1995). Unsupervised word sense disambigua-
tion rivaling supervised methods. Proceedings of the 33rd
Annual Meeting of the Association for Computational Lin-
guistics, 189-196.
D. Zhou, O. Bousquet, T. Navin Lal, J. Weston and B.
Sch¨olkopf. (2004). Learning with local and global con-
sistency. Advances in Neural Information Processing Sys-
tems, 16:321-328.
D. Zhou, J. Huang and B. Sch¨olkopf. (2005). Learning from
labeled and unlabeled data on a directed graph. Proceed-
ings of the 22nd International Conference on Machine
Learning, 1041-1048.
X. Zhu, Z. Ghahramani and J. Lafferty. (2003). Semi-
supervised learning using Gaussian fields and harmonic
functions. Proceedings of the 20th International Confer-
ence on Machine Learning, 912-919.
A Deriving the gradient of the entropy
We wish to show that
a36
a36
a66
a37
a46
a63a83 a65
a35a37a44a45a18
a63a32a76
a48a50a49
a12a15a24a37a51a14 a16
a83
a20 a28
a84a87a86a89a88
a48a50a49
a12a15a24a37a51a14 a16
a83
a20 a28a39a38
a8
a46
a63a83 a65
a35a37a44a45a18 a40 a86 a41 a42
a26
a16
a76
a17
a3 a43a45a44a50a46 a20 a48
a68
a12a15a14 a16
a83
a20 a22a25a24a56a28a33a49
a66
(5)
First, note that some simple calculation yields
a36
a84a87a86a89a88
a54
a49
a12a15a14 a16
a83
a20 a28
a36
a66
a55
a8
a63 a76
a48a50a49
a12a15a24a37a51a14 a16
a83
a20 a28
a68 a55
a12a15a14 a16
a83
a20 a22a25a24a56a28
and
a36
a48a50a49
a12a15a24a37a51a14 a16
a83
a20 a28
a36
a66
a55
a8
a36
a36
a66
a55
a1a2
a55a58a57 a59
a10a67a71
a66
a22
a68
a12a15a14a17a16
a83
a20a29a22a25a24a56a28a25a73 a40
a54
a49
a12a15a14
a16
a83
a20
a28 a3a4
a8
a48a50a49
a12a15a24a37a51a14 a16
a83
a20 a28
a68 a55
a12a15a14 a16
a83
a20 a22a25a24a56a28
a90
a48a50a49
a12a15a24a37a51a14 a16
a83
a20 a28
a63a34a76
a48a50a49
a12a15a24a37a51a14 a16
a83
a20 a28
a68 a55
a12a15a14 a16
a83
a20 a22a25a24a56a28
Therefore
a36
a36
a66
a55
a37
a46
a63a83 a65
a35 a44a45a18
a63 a76
a48a50a49
a12a15a24a37a51a14 a16
a83
a20 a28
a84a87a86a89a88
a48 a49
a12a15a24 a51a14 a16
a83
a20 a28 a38
a8
a46
a63a83 a65
a35a37a44a45a18
a36
a36
a66
a55
a37
a63a34a76
a48a50a49
a12a15a24a37a51a14 a16
a83
a20 a28 a71
a66
a22
a68
a12a15a14 a16
a83
a20 a22a25a24a56a28a25a73
a90
a84a87a86a89a88
a54
a49
a12a15a14 a16
a83
a20 a28 a40
a8
a46
a63a83 a65
a35a37a44a45a18
a37
a63a77a76
a48a50a49
a12a15a24a37a51a14 a16
a83
a20 a28
a68 a55
a12a15a14 a16
a83
a20 a22a25a24a56a28
a101
a63a32a76
a36
a48 a49
a12a15a24 a51a14a17a16
a83
a20a25a28
a36
a66
a55
a71
a66
a22
a68
a12a15a14 a16
a83
a20 a22a25a24a17a28a25a73
a90
a63 a76
a48a50a49
a12a15a24a37a51a14 a16
a83
a20 a28
a68a64a55
a12a15a14 a16
a83
a20 a22a25a24a56a28 a38
a8
a46
a63a83 a65
a35a37a44a45a18
a37
a63a34a76
a48a50a49
a12a15a24a37a51a14 a16
a83
a20 a28
a68 a55
a12a15a14 a16
a83
a20 a22a25a24a17a28 a71
a66
a22
a68
a12a15a14 a16
a83
a20 a22a25a24a56a28a25a73
a90 a28
a63a32a76
a48a50a49
a12a15a24a37a51a14 a16
a83
a20 a28 a71
a66
a22
a68
a12a15a14 a16
a83
a20 a22a25a24a56a28a25a73a31a30
a28
a63a34a76
a48 a49
a12a15a24 a51a14 a16
a83
a20 a28
a68 a55
a12a15a14 a16
a83
a20 a22a25a24a56a28a31a30 a38
a8
a46
a63a83 a65
a35 a44a45a18
a37
a63 a64
a66
a64
a48
a63 a76
a48a50a49
a12a15a24a37a51a14 a16
a83
a20 a28
a68a64a55
a12a15a14 a16
a83
a20 a22a25a24a56a28
a68a70a64
a12a15a14 a16
a83
a20 a22a25a24a17a28
a90 a28
a63 a76
a48a50a49
a12a15a24a37a51a14 a16
a83
a20 a28
a68a70a64
a12a15a14 a16
a83
a20 a22a25a24a56a28a31a30
a28
a63a34a76
a48 a49
a12a15a24 a51a14 a16
a83
a20 a28
a68 a55
a12a15a14 a16
a83
a20 a22a25a24a56a28a31a30a58a49 a38
In the vector form, this can be written as (5)
216
