An SVM Based Voting Algorithm
with Application to Parse Reranking
Libin Shen and Aravind K. Joshi
Department of Computer and Information Science
University of Pennsylvania
Philadelphia, PA 19104, USA
{libin,joshi}@linc.cis.upenn.edu
Abstract
This paper introduces a novel Support Vec-
tor Machines (SVMs) based voting algorithm
for reranking, which provides a way to solve
the sequential models indirectly. We have
presented a risk formulation under the PAC
framework for this voting algorithm. We have
applied this algorithm to the parse reranking
problem, and achieved labeled recall and pre-
cision of 89.4%/89.8% on WSJ section 23 of
Penn Treebank.
1 Introduction
Support Vector Machines (SVMs) have been successfully
used in many machine learning tasks. Unlike the error-
driven algorithms, SVMs search for the hyperplane that
separates a set of training samples that contain two dis-
tinct classes and maximizes the margin between these
two classes. The ability to maximize the margin is be-
lieved to be the reason for SVMs’ superiority over other
classifiers. In addition, SVMs can achieve high perfor-
mance even with input data of high dimensional feature
space, especially because of the use of the ”kernel trick”.
However, the incorporation of SVMs into sequential
models remains a problem. An obvious reason is that
the output of an SVM is the distance to the separating
hyperplane, but not a probability. A possible solution
to this problem is to map SVMs’ results into probabili-
ties through a Sigmoid function, and use Viterbi search
to combine those probabilities (Platt, 1999). However,
this approach conflicts with SVMs’ purpose of achieving
the so-called global optimization1. First, this approach
may constrain SVMs to local features because of the left-
to-right scanning strategy. Furthermore, like other non-
generative Markov models, it suffers from the so-called
1By global we mean the use of quadratic optimization in
margin maximization.
label bias problem, which means that the transitions leav-
ing a given state compete only against each other, rather
than against all other transitions in the model (Lafferty et
al., 2001). Intuitively, it is the local normalization that
results in the label bias problem.
One way of using discriminative machine learning al-
gorithms in sequential models is to rerank the n-best out-
puts of a generative system. Reranking uses global fea-
tures as well as local features, and does not make lo-
cal normalization. If the output set is large enough, the
reranking approach may help to alleviate the impact of
the label bias problem, because the victim parses (i.e.
those parses which get penalized due to the label bias
problem) will have a chance to take part in the rerank-
ing.
In recent years, reranking techniques have been suc-
cessfully applied to the so-called history-based models
(Black et al., 1993), especially to parsing (Collins, 2000;
Collins and Duffy, 2002). In a history-based model, the
current decision depends on the decisions made previ-
ously. Therefore, we may regard parsing as a special form
of sequential model without losing generality.
Collins (2000) has proposed two reranking algorithms
to rerank the output of an existing parser (Collins, 1999,
Model 2). One is based on Markov Random Fields, and
the other is based on a boosting approach. In (Collins and
Duffy, 2002), the use of Voted Perceptron (VP) (Freund
and Schapire, 1999) for the parse reranking problem has
been described. In that paper, the tree kernel (Collins
and Duffy, 2001) has been used to efficiently count the
number of common subtrees as described in (Bod, 1998).
In this paper we will follow the reranking approach.
We describe a novel SVM-based voting algorithm for
reranking. It provides an alternative way of using a large
margin classifier for sequential models. Instead of using
the parse tree itself as a training sample, we use a pair of
parse trees as a sample, which is analogous to the pref-
erence relation used in the context of ordinal regression
(Herbrich et al., 2000). Furthermore, we justify the al-
gorithm through a modification of the proof of the large
margin rank boundaries for ordinal regression. We then
apply this algorithm to the parse reranking problem.
1.1 A Short Introduction of SVMs
In this section, we give a short introduction of Support
Vector Machines. We follow (Vapnik, 1998)’s definition
of SVMs. For each training sample (yi,xi), yi repre-
sents its class, and xi represents its input vector defined
on a d-dimensional space. Suppose the training samples
{(y1,x1),...,(yn,xn)} (xi ∈ Rd, yi ∈ {−1,1}) can be
separated by a hyperplane H: (x • w) + b = 0, which
means
yi((xi •w) + b) ≥ 1, (1)
where w is normal to the hyperplane. To train an SVM is
equivalent to searching for the optimal separating hyper-
plane that separates the training data without error and
maximizes the margin between two classes of samples. It
can be shown that maximizing the margin is equivalent to
minimizing ||w||2.
In order to handle linearly non-separable cases, we
introduce a positive slack variable ξi for each sample
(yi,xi). Then training can be reduced to the following
Quadratic Programming (QP) problem.
Maximize:
LD(α) ≡
lsummationdisplay
i=1
αi − 12
lsummationdisplay
i,j=1
yiyjαiαj(xi •xj) (2)
subject to: 0 ≤ αi ≤ C andsummationtexti αiyi = 0,
where αi(i = 1...l) are the Lagrange multipliers, l is the
total number of training samples, and C is a weighting
parameter for mis-classification.
Since linearly non-separable samples may become sep-
arable in a high-dimensional space, SVMs employ the
”kernel trick” to implicitly separate training samples in
a high-dimensional feature space. Let Φ : Rd mapsto→ Rh
be a function that maps a d-dimensional input vector
x to an h-dimensional feature vector Φ(x). In order
to search for the optimal separating hyperplane in the
higher-dimensional feature space, we only need to sub-
stitute Φ(xi)•Φ(xj) with xi •xj in formula (2).
If there is a function K, such that K(xi,xj) = Φ(xi)•
Φ(xj), we don’t need to compute Φ(xi) explicitly. K is
called a kernel function. Thus during the training phase
we need to solve the following QP problem.
Maximize:
LD(α) ≡
lsummationdisplay
i=1
αi − 12
lsummationdisplay
i,j=1
yiyjαiαjK(xi,xj). (3)
subject to: 0 ≤ αi ≤ C, andsummationtexti αiyi = 0.
Let x be a test vector, the decision function is
f(x) = sgn(
Nssummationdisplay
j=1
αjyjK(sj,x) + b) (4)
where sj is a training vector whose corresponding La-
grange multiplier αj > 0. sj is called a support vector.
Ns is the total number of the support vectors. According
to (4), the decision function only depends on the support
vectors.
It is worth noting that not any function K can be used
as a kernel. We call function K : Rd × Rd mapsto→ R
a well-defined kernel if and only if there is a mapping
function Φ : Rd mapsto→ Rh such that, for any xi,xj ∈ Rd,
K(xi,xj) = Φ(xi)•Φ(xj). One way of testing whether
a function is a well-defined kernel is to use the Mer-
cer’s theorem (Vapnik, 1998) by utilizing the positive
semidefinteness property. However, as far as a discrete
kernel is concerned, there is a more convenient way to
show that a function is a well-defined kernel. This is
achieved by showing that a function K is a kernel by find-
ing the corresponding mapping function Φ. This method
was used in the proof of the string subsequence kernel
(Cristianini and Shawe-Tayor, 2000) and the tree kernel
(Collins and Duffy, 2001).
1.2 Large Margin Classifiers
SVMs are called large margin classifiers because they
search for the hyperplane that maximizes the margin. The
validity of the large margin method is guaranteed by the
theorems of Structural Risk Minimization (SRM) under
Probably Approximately Correct (PAC) framework2; test
error is related to training data error, number of training
samples and the capacity of the learning machine (Smola
et al., 2000).
Vapnik-Chervonenkis (VC) dimension (Vapnik, 1999),
as well as some other measures, is used to estimate the
complexity of the hypothesis space, or the capacity of the
learning machine. The drawback of VC dimension is that
it ignores the structure of the mapping from training sam-
ples to hypotheses, and concentrates solely on the range
of the possible outputs of the learning machine (Smola
et al., 2000). In this paper we will use another measure,
the so-called Fat Shattering Dimension (Shawe-Taylor et
al., 1998), which is shown to be more accurate than VC
dimension (Smola et al., 2000), to justify our voting al-
gorithm,
Let F be a family of hypothesis functions. The fat
shattering dimension of F is a function from margin ρ to
the maximum number of samples such that any subset of
2SVM’s theoretical accuracy is much lower than their actual
performance. The ability to maximize the margin is believed to
be the reason for SVMs’ superiority over other classifiers.
these samples can be classified with margin ρ by a func-
tion in F. An upper bound of the expected error is given
in Theorem 1 below (Shawe-Taylor et al., 1998). We will
use this theorem to justify the new voting algorithm.
Theorem 1 Consider a real-valued function class F
having fat-shattering function bounded above by the
function afat : R → N which is continuous from the
right. Fix θ ∈ R. If a learner correctly classifies m
independently generated examples z with h = Tθ(f) ∈
Tθ(F) such that erz(h) = 0 and ρ = min|f(xi) − θ|,
then with confidence 1 − δ the expected error of h is
bounded from above by
2
m(klog(
8em
k )log(32m) + log(
8m
δ )) (5)
where k = afat(ρ/8).
2 A New SVM-based Voting Algorithm
Let xij be the jth candidate parse for the ith sentence in
training data. Let xi1 is the parse with the highest f-score
among all the parses for the ith sentence.
We may take xi1 as positive samples, and xij(j>1) as
negative samples. However, experiments have shown that
this is not the best way to utilize SVMs in reranking (Di-
jkstra, 2001). A trick to be used here is to take a pair of
parses as a sample: for any i and j > 1, (xi1,xij) is a
positive sample, and (xij,xi1) is a negative sample.
Similar idea was employed in the early works of parse
reranking. In the boosting algorithm of (Collins, 2000),
for each sample (parse) xij, its margin is defined as
F(xi1, ¯α) − F(xij, ¯α), where F is a score function and
¯α is the parameter vector. In (Collins and Duffy, 2002),
for each offending parse, the parameter vector updating
function is in the form of w = w + h(xi1) − h(xij),
where w is the parameter vector and h returns the feature
vector of a parse. But neither of these two papers used a
pair of parses as a sample and defined functions on pairs
of parses. Furthermore, the advantage of using difference
between parses was not theoretically clarified, which we
will describe in the next section.
As far as SVMs are concerned, the use of parses or
pairs of parses both maximize the margin between xi1
and xij, but the one using a single parse as a sample
needs to satisfy some extra constraints on the selection
of decision function. However these constraints are not
necessary (see section 3.3). Therefore the use of pairs of
parses has both theoretic and practical advantages.
Now we need to define the kernel on pairs of parses.
Let (t1,t2),(v1,v2) are two pairs of parses. Let K is
any kernel function on the space of single parses. The
preference kernel PK is defined on K as follows.
PK((t1,t2),(v1,v2)) ≡ K(t1,v1)−K(t1,v2)
−K(t2,v1) +K(t2,v2) (6)
The preference kernel of this form was previously used
in the context of ordinal regression in (Herbrich et al.,
2000). Then the decision function is
f((xj,xk)) =
Nssummationdisplay
i=1
αiyiPK((si1,si2),(xj,xk)) + b
= b + (
Nssummationdisplay
i=1
αiyi(K(si1,xj)−K(si2,xj)))
−(
Nssummationdisplay
i=1
αiyi(K(si1,xk)−K(si2,xk))),
where xj and xk are two distinct parses of a sentence,
(si1,si2) is the ith support vector, and Ns is the total
number of support vectors.
As we have defined them, the training samples are
symmetric with respect to the origin in the space. There-
fore, for any hyperplane that does not pass through the
origin, we can always find a parallel hyperplane that
crosses the origin and makes the margin larger. Hence,
the outcome separating hyperplane has to pass through
the origin, which means that b = 0.
Therefore, for each test parse x, we only need to com-
pute its score as follows.
score(x) =
Nssummationdisplay
i=1
αiyi(K(si1,x)−K(si2,x)), (7)
because
f((xj,xk)) = score(xj)−score(xk). (8)
2.1 Kernels
In (6), the preference kernel PK is defined on kernel K.
K can be any possible kernel. We will show that PK is
well-defined in the next section. In this paper, we con-
sider two kernels for K, the linear kernel and the tree
kernel.
In (Collins, 2000), each parse is associated with a set
of features. Linear combination of the features is used
in the decision function. As far as SVM is concerned,
we may encode the features of each parse with a vec-
tor. Dot product is used as the kernel K. Let u and v
are two parses. The computational complexity of linear
kernel O(|fu|∗|fv|), where |fu| and |fv| are the length
of the vectors associated with parse u and v respectively.
The goodness of the linear kernel is that it runs very fast
in the test phase, because coefficients of the support vec-
tors can be combined in advance. For a test parse x, the
computational complexity of test is only O(|fx|), which
is independent with the number of the support vectors.
In (Collins and Duffy, 2002), the tree kernel Tr is used
to count the total number of common sub-trees of two
parse trees. Let u and v be two trees. Because Tr can be
computed by dynamic programming, the computational
complexity of Tr(u,v) is O(|u|∗|v|), where |u| and |v|
are the tree sizes of u and v respectively. For a test parse
x, the computational complexity of the test is O(S∗|x|),
where S is the number of support vectors.
3 Justifying the Algorithm
3.1 Justifying the Kernel
Firstly, we show that the preference kernel PK defined
above is well-defined. Suppose kernel K is defined on
T × T. So there exists Φ : T mapsto→ H, such that
K(x1,x2) = Φ(x1)•Φ(x2) for any x1,x2 ∈ T.
It suffices to show that there exist space Hprime and
mapping function Φprime : T×T → Hprime such that
PK((t1,t2),(v1,v2)) = Φprime(t1,t2) • Φprime(t1,t2), where
t1,t2,v1,v2 ∈ T.
According to the definition of PK, we have
PK((t1,t2),(v1,v2))
= K(t1,v1)− K(t1,v2)− K(t2,v1) + K(t2,v2)
= Φ(t1)•Φ(v1)−Φ(t1)•Φ(v2)
−Φ(t2)•Φ(v1) + Φ(t2)•Φ(v2)
= (Φ(t1)−Φ(t2))•(Φ(v1)−Φ(v2)), (9)
Let Hprime = H and Φprime(x1,x2) = Φ(x1)−Φ(x2). Hence
kernel PK is well-defined.
3.2 Margin Bound for SVM-based Voting
We will show that the expected error of voting is bounded
from above in the PAC framework. The approach used
here is analogous to the proof of ordinal regression (Her-
brich et al., 2000). The key idea is to show the equiva-
lence of the voting risk and the classification risk.
Let X be the set of all parse trees. For each x ∈ X,
let ¯x be the best parse for the sentence related to x. Thus
the appropriate loss function for the voting problem is as
follows.
lvote(x,f) ≡
braceleftbigg 1 if f(¯x) < f(x)
0 otherwise (10)
where f is a parse scoring function.
Let E = {(x, ¯x)|x ∈ X} ∪ {(¯x,x)|x ∈ X}. E
is the space of event of the classification problem, and
PrE((x, ¯x)) = PrE((¯x,x)) = 12PrX(x). For any parse
scoring function f, let gf(x1,x2) ≡ sgn(f(x1)−f(x2)).
For classifier gf on space E, its loss function is
lclass(x1,x2,gf) ≡





1 if x1 = ¯x2 and
gf(x1,x2) = −1
1 if x2 = ¯x1 and
gf(x1,x2) = +1
0 otherwise
Therefore the expected risk Rvote(f) for the voting
problem is equivalent to the expected risk Rclass(gf) for
the classification problem.
Rvote(f)
= Ex∈X(lvote(x,f))
= E(x,¯x)∈E(lvote(x,f)) +E(¯x,x)∈E(lvote(x,f))
= E(x1,x2)∈E(lclass(x1,x2,gf))
= Rclass(gf) (11)
However, the definition of space E violates the inde-
pendently and identically distributed (iid) assumption.
Parses for the same sentence are not independent. If we
suppose that no two pairs of parses come from the same
sentence, then the idd assumption holds. In practice, the
number of sentences is very large, i.e. more than 30000.
So we may use more than one pair of parses of the same
sentence and still assume the idd property roughly, be-
cause for any two arbitrary pairs of parses, 29999 out of
30000, these two samples are independent.
Let ρ ≡ mini=1..n,j=2..mi |f(xi1) − f(xij)| =
mini=1..n,j=2..mi |g(xi1,xij)−0|. According to (11) and
Theorem 1 in section 1.2 we get the following theorem.
Theorem 2 If gf makes no error on the training data,
with confidence 1−δ
Rvote(f) = Rclass(gf)
≤ 2m(klog(8emk )log(32m)
+log(8mδ )), (12)
where k = afat(ρ/8),m =summationtexti=1..n(mi −1).
3.3 Justifying Pairwise Samples
An obvious way to use SVM is to use each single parse,
instead of a pair of parse trees, as a training sample. Only
the best parse of each sentence is regarded as a positive
sample, and all the rest are regarded as negative samples.
Similar to the pairwise system, it also maximizes the mar-
gin between the best parse of a sentence and all incorrect
parses of this sentence. Suppose f is the function result-
ing from the SVM. It requires yijf(xij) > 0 for each
sample (xij,yij). However this constraint is not neces-
sary. We only need to guarantee that f(xi1) > f(xij).
This is the reason for using pairs of parses as training
samples instead of single parses.
We may rewrite the score function (7) as follows.
score(x) =
summationdisplay
i,j
ci,jK(si,j,x), (13)
where i is the index for sentence, j is the index for parse,
and ∀isummationtextj ci,j = 0.
The format of score in (13) is the same as the deci-
sion function generated by an SVM trained on the single
parses as samples. However, there is a constraint that
the sum of the coefficients related to parses of the same
sentence is 0. So in this way we decrease the size of hy-
pothesis space based on the prior knowledge that only the
different segments of two distinct parses determine which
parse is better.
4 Related Work
The use of pairs of parse trees in our model is analogous
to the preference relation used in the ordinal regression
algorithm (Herbrich et al., 2000). In that paper, pairs of
objects have been used as training samples. For exam-
ple, let (r1,r2,...rm) be a list of objects in the training
data, where ri ranks ith. Then pairs of objects (ri−1,ri)
are training samples. Preference kernel PK in our paper
is the same as the preference kernel in (Herbrich et al.,
2000) in format.
However, the purpose of our model is different from
that of the ordinal regression algorithm. Ordinal regres-
sion searches for a regression function for ordinal values,
while our algorithm is designed to solve a voting prob-
lem. As a result, the two algorithms differ on the def-
inition of the margin. In ordinal regression, the margin
is min|f(ri) − f(ri−1)|, where f is the regression func-
tion for ordinal values. In our algorithm, the margin is
min|score(xi1)−score(xij)|.
In (Kudo and Matsumoto, 2001), SVMs have been em-
ployed in the NP chunking task, a typical labeling prob-
lem. However, they have used a deterministic algorithm
for decoding.
In (Collins, 2000), two reranking algorithms were pro-
posed. In both of these two models, the loss functions are
computed directly on the feature space. All the features
are manually defined.
In (Collins and Duffy, 2002), the Voted Perceptron al-
gorithm was used to in parse reranking. It was shown
in (Freund and Schapire, 1999; Graepel et al., 2001) that
error bound of (voted) Perceptron is related to margins
existing in the training data, but these algorithm are not
supposed to maximize margins. Variants of the Percep-
tron algorithm, which are known as Approximate Maxi-
mal Margin classifier, such as PAM (Krauth and Mezard,
1987), ALMA (Gentile, 2001) and PAUM (Li et al.,
2002), produce decision hyperplanes within ratio of the
maximal margin. However, almost all these algorithms
are reported to be inferior to SVMs in accuracy, while
more efficient in training.
Furthermore, these variants of the Perceptron algo-
rithm take advantage of the large margin existing in the
training data. However, in NLP applications, samples are
usually inseparable even if the kernel trick is used. SVMs
can still be trained to maximize the margin through the
method of soft margin.
5 Experiments and Analysis
We use SVMlight (Joachims, 1998) as the SVM clas-
sifier. The soft margin parameter C is set to its default
value in SVMlight.
We use the same data set as described in (Collins,
2000; Collins and Duffy, 2002). Section 2-21 of the Penn
WSJ Treebank (Marcus et al., 1994) are used as train-
ing data, and section 23 is used for final test. The train-
ing data contains around 40,000 sentences, each of which
has 27 distinct parses on average. Of the 40,000 training
sentences, the first 36,000 are used to train SVMs. The
remaining 4,000 sentences are used as development data.
The training complexity for SVMlight is roughly
O(n2.1) (Joachims, 1998), where n is the number of the
training samples. One solution to the scaling difficulties
is to use the Kernel Fisher Discriminant as described in
(Salomon et al., 2002). In this paper, we divide train-
ing data into slices to speed up training. Each slice con-
tains two pairs of parses from each sentence. Specifically,
slice i contains positive samples (( ˜pk,pki),+1) and neg-
ative samples ((pki, ˜pk),−1), where ˜pk is the best parse
for sentence k, pki is the parse with the ith highest log-
likelihood in all the parses for sentence k and it is not the
best parse. There are about 60000 parses in each slice in
average. For each slice, we train an SVM. Then results
of SVMs are put together with a simple combination. It
takes about 2 days to train a slice on a P3 1.13GHz pro-
cessor.
As a result of this subdivision of the training data into
slices, we cannot take advantage of SVM’s global opti-
mization ability. This seems to nullify our effort to cre-
ate this new algorithm. However, our new algorithm is
still useful for the following reasons. Firstly, with the im-
provement in the computing resources, we will be able to
use larger slices so as to utilize more global optimization.
SVMs are superior to other linear classifiers in theory. On
the other hand, the current size of the slice is large enough
for other NLP applications like text chunking, although it
is not large enough for parse reranking. The last reason
is that we have achieved state-of-the-art results even with
the sliced data.
We have used both a linear kernel and a tree kernel.
For the linear kernel test, we have used the same dataset
as that in (Collins, 2000). In this experiment, we first train
22 SVMs on 22 distinct slices. In order to combine those
SVMs results, we have tried mapping SVMs’ results to
probabilities via a Sigmoid as described in (Platt, 1999).
We use the development data to estimate parameter A
and B in the Sigmoid
Pi(y = 1|fi) = 11 + Ae−f
iB
, (14)
Table 1: Results on section 23 of the WSJ Treebank.
LR/LP = labeled recall/precision. CBs = average number
of crossing brackets per sentence. 0 CBs, 2 CBs are the
percentage of sentences with 0 or ≤ 2 crossing brackets
respectively. CO99 = Model 2 of (Collins, 1999). CH00
= (Charniak, 2000). CO00 = (Collins, 2000).
≤40 Words (2245 sentences)
Model LR LP CBs 0 CBs 2 CBs
CO99 88.5% 88.7% 0.92 66.7% 87.1%
CH00 90.1% 90.1% 0.74 70.1% 89.6%
CO00 90.1% 90.4% 0.73 70.7% 89.6%
SVM 89.9% 90.3% 0.75 71.7% 89.4%
≤100 Words (2416 sentences)
Model LR LP CBs 0 CBs 2 CBs
CO99 88.1% 88.3% 1.06 64.0% 85.1%
CH00 89.6% 89.5% 0.88 67.6% 87.7%
CO00 89.6% 89.9% 0.87 68.3% 87.7%
SVM 89.4% 89.8% 0.89 69.2% 87.6%
where fi is the result of the ith SVM. The parse with
maximal value of producttexti Pi(y = 1|fi) is chosen as the top-
most parse. Experiments on the development data shows
that the result is better if Ae−fiB is much larger than 1.
Therefore
nproductdisplay
i=1
Pi(y = 1|fi) =
nproductdisplay
i=1
1
1 + Ae−fiB
≈
nproductdisplay
i
1
Ae−fiB
= A−ne(B
summationtextn
i=1 fi) (15)
Therefore, we may use summationtexti fi directly, and there is no
need to estimate A and B in (14). Then we combine
SVMs’ result with the log-likelihood generated by the
parser (Collins, 1999). Parameter β is used as the weight
of the log-likelihood. In addition, we find that our SVM
has greater labeled precision than labeled recall, which
means that the system prefer parses with less brackets.
So we take the number of brackets as another feature to
be considered, with weight α. α and β are estimated on
the development data.
The result is shown in Table 1. The performance of
our system matches the results of (Charniak, 2000), but
is a little lower than the results of the Boosting system
in (Collins, 2000), except that the percentage of sen-
tences with no crossing brackets is 1% higher than that of
(Collins, 2000). Since we have to divide data into slices,
we cannot take full advantage of the margin maximiza-
tion.
Figure 1 shows the learning curves. β is used to con-
Figure 1: Learning curves on the development dataset of
(Collins, 2000). X-axis stands for the number of slices to
be combined. Y-axis stands for the F-score.
0.89
0.892
0.894
0.896
0.898
0.9
0.902
0.904
0.906
0.908
0.91
0 5 10 15 20
number of slices
baselineβ
=0
a51
a51
a51a51a51a51a51a51
a51a51a51a51a51a51a51a51a51a51a51a51a51a51
a51β
=200
+
+
+++
+++++++++++++++++
+β
=500
a50
a50a50
a50a50
a50a50a50a50
a50a50a50a50a50a50a50a50
a50a50a50a50a50
a50β
=1000
××
××
××
××××
××××××
××××××
×
Table 2: Results on section 23 of the WSJ Treebank.
LR/LP = labeled recall/precision. CBs = average num-
ber of crossing brackets per sentence. CO99 = Model 2
of (Collins, 1999). CD02 = (Collins and Duffy, 2002)
≤100 Words (2416 sentences)
Model LR LP CBs
CO99 88.1% 88.3% 1.06
CD02 88.6% 88.9% 0.99
SVM(tree) 88.7% 88.8% 0.99
trol the weight of log-likelihood given by the parser. The
proper value of β depends on the size of training data.
The best result does not improve much after combining 7
slices of training data. We think this is due the limitation
of local optimization.
Our next experiment is on the tree kernel as it is used
in (Collins and Duffy, 2002). We have only trained 5
slices, since each slice takes about 2 weeks to train on
a P3 1.13GHz processor. In addition, the speed of test
for the tree kernel is much slower than that for the lin-
ear kernel. The experimental result is shown in Table 2
The results of our SVM system match the results of the
Voted Perceptron algorithm in (Collins and Duffy, 2002),
although only 5 slices, amounting to less than one fourth
of the whole training dataset, have been used.
6 Conclusions and Future Work
We have introduced a new approach for applying SVMs
to sequential models indirectly, and described a novel
SVM based voting algorithm inspired by the parse
reranking problem. We have presented a risk formula-
tion under the PAC framework for this voting algorithm,
and applied this algorithm to the parse reranking prob-
lem, and achieved LR/LP of 89.4%/89.8% on WSJ sec-
tion 23.
Experimental results show that the SVM with a linear
kernel is superior to the SVM with tree kernel in both
accuracy and speed. The SVM with tree kernel only
achieves a rather low f-score because it takes too many
unrelated features into account. The linear kernel is de-
fined on the features which are manually selected from a
large set of possible features.
As far as context-free grammars are concerned, it will
be hard to include more features into the current feature
set. If we simply use n-grams on context-free grammars,
it is very possible that we will introduce many useless fea-
tures, which may be harmful as they are in tree kernel sys-
tems. One way to include more useful features is to take
advantage of the derivation tree and the elementary trees
in Lexicalized Tree Adjoining Grammar (LTAG) (Joshi
and Schabes, 1997). The basic idea is that each elemen-
tary tree and every segment in a derivation tree is linguis-
tically meaningful.
We also plan to apply this algorithm to other sequen-
tial models, especially to the Supertagging problem. We
believe it will also be very useful to problems of POS
tagging and NP chunking. Compared to parse rerank-
ing, they have a much smaller training dataset and feature
size, which is more suitable for our SVM-based voting
problem.
Acknowledgments
We thank Michael Collins for help concerning the data
set, and Anoop Sarkar for his comments. We also thank
three anonymous reviewers for helpful comments.

References
E. Black, F. Jelinek, J. Lafferty, Magerman D. M.,
R. Mercer, and S. Roukos. 1993. Towards history-
based grammars: Using richer models for probabilistic
parsing. In Proceedings of the ACL 1993.
Rens Bod. 1998. Beyond Grammar: An Experience-
Based Theory of Language. CSLI Publica-
tions/Cambridge University Press.
E. Charniak. 2000. A maximum-entropy-inspired parser.
In Proceedings of NAACL 2000.
Michael Collins and Nigel Duffy. 2001. Convolution
kernels for natural language. In Proceedings of Neural
Information Processing Systems (NIPS 14).
Michael Collins and Nigel Duffy. 2002. New ranking al-
gorithms for parsing and tagging: Kernels over discrete
structures, and the voted perceptron. In Proceedings of
ACL 2002.
Michael Collins. 1999. Head-Driven Statistical Models
for Natural Language Parsing. Ph.D. thesis, Univer-
sity of Pennsylvania.
Michael Collins. 2000. Discriminative reranking for nat-
ural language parsing. In Proceedings of the 7th Inter-
national Conference on Machine Learning.
N. Cristianini and J. Shawe-Tayor. 2000. An introduction
to Support Vector Machines and other kernel-based
learning methods. Cambridge University Press.
Emma Dijkstra. 2001. Support vector machines for parse
selection. Master’s thesis, Univ. of Edinburgh.
Yoav Freund and Robert E. Schapire. 1999. Large mar-
gin classification using the perceptron algorithm. Ma-
chine Learning, 37(3):277–296.
Claudio Gentile. 2001. A new approximate maximal
margin classification algorithm. Journal of Machine
Learning Research, 2:213–242.
Thore Graepel, Ralf Herbrich, and Robert C. Williamson.
2001. From margin to sparsity. In Advances in Neural
Information Processing Systems 13.
Ralf Herbrich, Thore Graepel, and Klaus Obermayer.
2000. Large margin rank boundaries for ordinal re-
gression. In Advances in Large Margin Classifiers,
pages 115–132. MIT Press.
Thorsten Joachims. 1998. Making large-scale support
vector machine learning practical. In Advances in Ker-
nel Methods: Support Vector Machine. MIT Press.
A. Joshi and Y. Schabes. 1997. Tree-adjoining gram-
mars. In G. Rozenberg and A. Salomaa, editors, Hand-
book of Formal Languages, volume 3, pages 69 – 124.
Springer.
W. Krauth and M. Mezard. 1987. Learning algorithms
with optimal stability in neural networks. Journal of
Physics A, 20:745–752.
Taku Kudo and Yuji Matsumoto. 2001. Chunking with
support vector machines. In Proceedings of NAACL
2001.
J. Lafferty, A. McCallum, and F. Pereira. 2001. Condi-
tional random fields: Probabilistic models for segmen-
tation and labeling sequence data. In Proceedings of
ICML.
Yaoyong Li, Hugo Zaragoza, Ralf Herbrich, John Shawe-
Taylor, and Jaz Kandola. 2002. The perceptron al-
gorithm with uneven margins. In Proceedings of the
International Conference of Machine Learning.
Mitchell P. Marcus, Beatrice Santorini, and Mary Ann
Marcinkiewicz. 1994. Building a large annotated cor-
pus of English: The Penn Treebank. Computational
Linguistics, 19(2):313–330.
John Platt. 1999. Probabilistic outputs for support vector
machines and comparisons to regularized likelihood
methods. In Advances in Large Margin Classifiers.
MIT Press.
Jesper Salomon, Simon King, and Miles Osborne. 2002.
Framewise phone classification using support vector
machines. In Proceedings of ICSLP 2002.
John Shawe-Taylor, Peter L. Bartlett, Robert C.
Williamson, and Martin Anthony. 1998. Structural
risk minimization over data-dependent hierarchies.
IEEE Trans. on Information Theory, 44(5):1926–1940.
A.J. Smola, P. Bartlett, B. Sch¨olkopf, and C. Schuurmans.
2000. Introduction to large margin classifiers. In A.J.
Smola, P. Bartlett, B. Sch¨olkopf, and C. Schuurmans,
editors, Advances in Large Margin Classifiers, pages
1–26. MIT Press.
Vladimir N. Vapnik. 1998. Statistical Learning Theory.
John Wiley and Sons, Inc.
Vladimir N. Vapnik. 1999. The Nature of Statistical
Learning Theory. Springer, 2 edition.
