Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing (EMNLP 2006), pages 129–137,
Sydney, July 2006. c©2006 Association for Computational Linguistics
Incremental Integer Linear Programming for Non-projective Dependency
Parsing
Sebastian Riedel and James Clarke
School of Informatics, University of Edinburgh
2 Bucclecuch Place, Edinburgh EH8 9LW, UK
s.r.riedel@sms.ed.ac.uk, jclarke@ed.ac.uk
Abstract
Integer Linear Programming has recently
been used for decoding in a number of
probabilistic models in order to enforce
global constraints. However, in certain ap-
plications, such as non-projective depen-
dency parsing and machine translation,
the complete formulation of the decod-
ing problem as an integer linear program
renders solving intractable. We present an
approach which solves the problem in-
crementally, thus we avoid creating in-
tractable integer linear programs. This ap-
proach is applied to Dutch dependency
parsing and we show how the addition
of linguistically motivated constraints can
yield a signi cant improvement over state-
of-the-art.
1 Introduction
Many inference algorithms require models to
make strong assumptions of conditional indepen-
dence between variables. For example, the Viterbi
algorithm used for decoding in conditional ran-
dom  elds requires the model to be Markovian.
Strong assumptions are also made in the case of
McDonald et al.’s (2005b) non-projective depen-
dency parsing model. Here attachment decisions
are made independently of one another1. However,
often such assumptions can not be justi ed. For
example in dependency parsing, if a subject has
already been identi ed for a given verb, then the
probability of attaching a second subject to the
verb is zero. Similarly, if we  nd that one coor-
dination argument is a noun, then the other argu-
1If we ignore the constraint that dependency trees must be
cycle-free (see sections 2 and 3 for details).
ment cannot be a verb. Thus decisions are often
co-dependent.
Integer Linear Programming (ILP) has recently
been applied to inference in sequential condi-
tional random  elds (Roth and Yih, 2004), this
has allowed the use of truly global constraints
during inference. However, it is not possible to
use this approach directly for a complex task like
non-projective dependency parsing due to the ex-
ponential number of constraints required to pre-
vent cycles occurring in the dependency graph.
To model all these constraints explicitly would re-
sult in an ILP formulation too large to solve ef -
ciently (Williams, 2002). A similar problem also
occurs in an ILP formulation for machine transla-
tion which treats decoding as the Travelling Sales-
man Problem (Germann et al., 2001).
In this paper we present a method which extends
the applicability of ILP to a more complex set of
problems. Instead of adding all the constraints we
wish to capture to the formulation, we  rst solve
the program with a fraction of the constraints. The
solution is then examined and, if required, addi-
tional constraints are added. This procedure is re-
peated until all constraints are satis ed. We apply
this dependency parsing approach to Dutch due
to the language’s non-projective nature, and take
the parser of McDonald et al. (2005b) as a starting
point for our model.
In the following section we introduce depen-
dency parsing and review previous work. In Sec-
tion 3 we present our model and formulate it as
an ILP problem with a set of linguistically mo-
tivated constraints. We include details of an in-
cremental algorithm used to solve this formula-
tion. Our experimental set-up is provided in Sec-
tion 4 and is followed by results in Section 5 along
with runtime experiments. We  nally discuss fu-
129
Figure 1: A Dutch dependency tree for  I’ll come
at twelve and then you’ll get what you deserve 
ture research and potential improvements to our
approach.
2 Dependency Parsing
Dependency parsing is the task of attaching words
to their arguments. Figure 1 shows a dependency
graph for the Dutch sentence  I’ll come at twelve
and then you’ll get what you deserve (taken from
the Alpino Corpus (van der Beek et al., 2002)). In
this dependency graph the verb  kom is attached
to its subject  ik .  kom is referred to as the head
of the dependency and  ik as its child. In labelled
dependency parsing edges between words are la-
belled with the relation captured. In the case of
the dependency between  kom and  ik the label
would be  subject .
In a dependency tree every token must be the
child of exactly one other node, either another to-
ken or the dummy root node. By de nition, a de-
pendency tree is free of cycles. For example, it
must not contain dependency chains such as  en 
→ kom → ik → en . For a more formal def-
inition see previous work (Nivre et al., 2004).
An important distinction between dependency
trees is whether they are projective or non-
projective. Figure 1 is an example of a projec-
tive dependency tree, in such trees dependencies
do not cross. In Dutch and other  exible word or-
der languages such as German and Czech we also
encounter non-projective trees, in these cases the
trees contain crossing dependencies.
Dependency parsing is useful for applications
such as relation extraction (Culotta and Sorensen,
2004) and machine translation (Ding and Palmer,
2005). Although less informative than lexicalised
phrase structures, dependency structures still cap-
ture most of the predicate-argument information
needed for applications. It has the advantage of be-
ing more ef cient to learn and parse.
McDonald et al. (2005a) introduce a depen-
dency parsing framework which treats the task as
searching for the projective tree that maximises
the sum of local dependency scores. This frame-
Figure 2: An incorrect partial dependency tree.
The verb  krijg is incorrectly coordinated with
the preposition  om .
work is ef cient and has also been extended to
non-projective trees (McDonald et al., 2005b). It
provides a discriminative online learning algo-
rithm which when combined with a rich feature set
reaches state-of-the-art performance across multi-
ple languages.
However, within this framework one can only
de ne features over single attachment decisions.
This leads to cases where basic linguistic con-
straints are not satis ed (e.g. verbs with two sub-
jects or incompatible coordination arguments). An
example of this for Dutch is illustrated in Figure 2
which was produced by the parser of McDonald
et al. (2005b). Here the parse contains a coordi-
nation of incompatible word classes (a preposition
and a verb).
Our approach is able to include additional con-
straints which forbid con gurations such as those
in Figure 2. While McDonald and Pereira (2006)
address the issue of local attachment decisions by
de ning scores over attachment pairs, our solution
is more general. Furthermore, it is complementary
in the sense that we could formulate their model
using ILP and then add constraints.
The method we present is not the only one that
can take global constraints into account. Deter-
ministic dependency parsing (Nivre et al., 2004;
Yamada and Matsumoto, 2003) can apply global
constraints by conditioning attachment decisions
on the intermediate parse built. However, for ef -
ciency a greedy search is used which may produce
sub-optimal solutions. This is not the case when
using ILP.
3 Model
Our underlying model is a modi ed labelled ver-
sion2 of McDonald et al. (2005b):
s(x,y) =
summationdisplay
(i,j,l)∈y
s(i,j,l)
=
summationdisplay
(i,j,l)∈y
w·f(i,j,l)
2Note that this is not described in the McDonald papers
but implemented in his software.
130
where x is a sentence, y is a set of labelled de-
pendencies, f(i,j,l) is a multidimensional fea-
ture vector representation of the edge from token
i to token j with label l and w the correspond-
ing weight vector. For example, a feature f101 in f
could be:
f101(i,j,l) =



1 if t(i) =  en ∧p(j) = V
∧l =  coordination 
0 otherwise
where t(i) is the word at token i and p(j) the part-
of-speech tag at token j.
Decoding in this model amounts to  nding the
y for a given x that maximises s(x,y):
yprime = arg maxy s(x,y)
while ful lling the following constraints:
T1 For every non-root token in x there exists ex-
actly one head; the root token has no head.
T2 There are no cycles.
Thus far, the formulation follows McDonald
et al. (2005b) and corresponds to the Maximum
Spanning Tree (MST) problem. In addition to T1
and T2, we include a set of linguistically moti-
vated constraints:
A1 Heads are not allowed to have more than one
outgoing edge labelled l for all l in a set of
labels U.
C1 In a symmetric coordination there is exactly
one argument to the right of the conjunction
and at least one argument to the left.
C2 In an asymmetric coordination there are no ar-
guments to the left of the conjunction and at
least two arguments to the right.
C3 There must be at least one comma between
subsequent arguments to the left of a sym-
metric coordination.
C4 Arguments of a coordination must have com-
patible word classes.
P1 Two dependencies must not cross if one of
their labels is in a set of labels P.
A1 covers constraints such as  there can only
be one subject if U contains  subject (see Sec-
tion 4.4 for more details of U). C1 applies to
con gurations which contain conjunctions such as
 en , of or  maar ( and ,  or and  but ). C2
will rule-out settings where a conjunction such as
 zowel (translates as  both ) having arguments
to its left. C3 forces coordination arguments to
the left of a conjunction to have commas in be-
tween. C4 avoids parses in which incompatible
word classes are coordinated, such as nouns and
verbs. Finally, P1 allows selective projective pars-
ing: we can, for instance, forbid the crossing of
 Noun-Determiner dependencies if we add the
corresponding label type to P(see Section 4.4 for
more details of P) . If we extend P to contain all
labels we forbid any type of crossing dependen-
cies. This corresponds to projective parsing.
3.1 Decoding
McDonald et al. (2005b) use the Chu-Liu-
Edmonds (CLE) algorithm to solve the maxi-
mum spanning tree problem. However, global con-
straints cannot be incorporated into the CLE algo-
rithm (McDonald et al., 2005b). We alleviate this
problem by presenting an equivalent Integer Lin-
ear Programming formulation which allows us to
incorporate global constraints naturally.
Before giving full details of our formulation
we  rst introduce some of the concepts of lin-
ear and integer linear programming (for a more
thorough introduction see Winston and Venkatara-
manan (2003)).
Linear Programming (LP) is a tool for solving
optimisation problems in which the aim is to max-
imise (or minimise) a given linear function with
respect to a set of linear constraints. The func-
tion to be maximised (or minimised) is referred
to as the objective function. A number of decision
variables are under our control which exert in u-
ence on the objective function. Speci cally, they
have to be optimised in order to maximise (or min-
imise) the objective function. Finally, a set of con-
straints restrict the values that the decision vari-
ables can take. Integer Linear Programming is an
extension of linear programming where all deci-
sion variables must take integer values.
There are several explicit formulations of the
MST problem as an integer linear program in the
literature (Williams, 2002). They are based on
the concept of eliminating subtours (cycles), cuts
(disconnections) or requiring intervertex  ows
(paths). However, in practice these formulations
cause long solve times  as the  rst two meth-
131
Algorithm 1 Incremental Integer Linear Program-
ming
C ←Bx
repeat
y←solve(C,Ox,Vx)
W ←violated(y,Ix)
C ←C∪W
until V =∅
return y
ods yield an exponential number of constraints.
Although the latter scales cubically, it produces
non-fractional solutions in its relaxed version; this
causes long runtimes for the branch and bound al-
gorithm (Williams, 2002) commonly used in inte-
ger linear programming. We found out experimen-
tally that dependency parsing models of this form
do not converge on a solution after multiple hours
of solving, even for small sentences.
As a workaround for this problem we follow an
incremental approach akin to the work of Warme
(1998). Instead of adding constraints which forbid
all possible cycles in advance (this would result
in an exponential number of constraints) we  rst
solve the problem without any cycle constraints.
The solution is then examined for cycles, and if
cycles are found we add constraints to forbid these
cycles; the solver is then run again. This process
is repeated until no more violated constraints are
found. The same procedure is used for other types
of constraints which are too expensive to add in
advance (e.g. the constraints of P1).
Algorithm 1 outlines our approach. For a sen-
tence x, Bx is the set of constraints that we add
in advance and Ix are the constraints we add in-
crementally. Ox is the objective function and Vx
is a set of variables including integer declarations.
solve(C,O,V ) maximises the objective function
O with respect to the set of constraints C and vari-
ables V . violated(y,I) inspects the proposed so-
lution (y) and returns all constraints in I which are
violated.
The number of iterations required in this ap-
proach is at most polynomial with respect to the
number of variables (Grcurrency1otschel et al., 1981). In
practice, this technique converges quickly (less
than 20 iterations in 99% of approximately 12,000
sentences), yielding average solve times of less
than 0.5 seconds.
Our approach converges quickly due to the
quality of the scoring function. Its weights have
been trained on cycle free data, thus it is more
likely to guide the search to a cycle free solution.
In the following section we present the objec-
tive function Ox, variables Vx and linear con-
straints Bx and Ix needed for parsing x using Al-
gorithm 1.
3.1.1 Variables
Vx contains a set of binary variables to represent
labelled edges:
ei,j,l ∀i∈0..n,j∈1..n,
l∈bestk(i,j)
where n is the number of tokens and the index 0
represents the root token. bestk(i,j) is the set of k
labels with highest s(i,j,l). ei,j,l equals 1 if there
is a edge (i.e., a dependency) with the label l be-
tween token i (head) and j (child), 0 otherwise. k
depends on the type of constraints we want to add.
For the plain MST problem it is suf cient to set
k = 1 and only take the best scoring label for each
token pair. However, if we want a constraint which
forbids duplicate subjects we need to provide ad-
ditional labels to fall back on.
Vx also contains a set of binary auxiliary vari-
ables:
di,j ∀i∈0..n,j∈1..n
which represent the existence of a dependency be-
tween tokens i and j. We connect these to the ei,j,l
variables by the constraint:
di,j =
summationdisplay
l∈bestk(i,j)
ei,j,l
3.1.2 Objective Function
Given the above variables our objective function
Ox can be represented as (using a suitable k):
summationdisplay
i,j
summationdisplay
l∈bestk(i,j)
s(i,j,l)·ei,j,l
3.1.3 Base Constraints
We  rst introduce a set of base constraints Bx
which we add in advance.
Only One Head (T1) Every token has exactly
one head: summationdisplay
i
di,j = 1
for non-root tokens j > 0 in x. An exception is
made for the arti cial root node:summationdisplay
i
di,0 = 0
132
Label Uniqueness (A1) To enforce uniqueness
of children with labels in U we augment our model
with the constraint:
summationdisplay
j
ei,j,l ≤1
for every token i in x and label l in U.
Symmetric Coordination (C1) For each con-
junction token i which forms a symmetric coor-
dination we add:
summationdisplay
j<i
di,j ≥1
and summationdisplay
j>i
di,j = 1
Asymmetric Coordination (C2) For each con-
junction token i which forms an asymmetric coor-
dination we add:
summationdisplay
j<i
di,j = 0
and summationdisplay
j>i
di,j ≥2
3.1.4 Incremental Constraints
Now we present the set of constraints Ix we add
incrementally. The constraints are chosen based on
the two criteria: (1) adding them to the base con-
straints (those added in advance) would result in
an extremely large program, and (2) it must be ef-
 cient to detect whether the constraint is violated
in y.
No Cycles (T2) For every possible cycle c for
the sentence x we have a constraint which forbids
the case where all edges in c are active simultane-
ously: summationdisplay
(i,j)∈c
di,j ≤|c|−1
Comma Coordination (C3) For each symmet-
ric conjunction token i which forms a symmetric
coordination and each set of tokens A in x to the
left of i with no comma between each pair of suc-
cessive tokens we add:
summationdisplay
a∈A
di,a ≤|A|−1
which forbids con gurations where i has the argu-
ment tokens A.
Compatible Coordination Arguments (C4)
For each conjunction token i and each set of to-
kens A in x with incompatible POS tags, we add a
constraint to forbid con gurations where i has the
argument tokens A.
summationdisplay
a∈A
di,a ≤|A|−1
Selective Projective Parsing (P1) For each pair
of triplets (i,j,l1) and (m,n,l2) we add the con-
straint:
ei,j,l1 + em,n,l2 ≤1
if l1 or l2 is in P.
3.2 Training
For training we use single-best MIRA (McDon-
ald et al., 2005a). This is an online algorithm that
learns by parsing each sentence and comparing
the result with a gold standard. Training is com-
plete after multiple passes through the whole cor-
pus. Thus we decode using the Chu-Liu-Edmonds
(CLE) algorithm due to its speed advantage over
ILP (see Section 5.2 for a detailed comparison of
runtimes).
The fact that we decode differently during train-
ing (using CLE) and testing (using ILP) may de-
grade performance. In the presence of additional
constraints weights may be able to capture other
aspects of the data.
4 Experimental Set-up
Our experiments were designed to answer the fol-
lowing questions:
1. How much do our additional constraints help
improve accuracy?
2. How fast is our generic inference method in
comparison with the Chu-Liu-Edmonds algo-
rithm?
3. Can approximations be used to increase the
speed of our method while remaining accu-
rate?
Before we try to answer these questions we brie y
describe our data, features used, settings for U and
P in our parametric constraints, our working envi-
ronment and our implementation.
133
4.1 Data
We use the Alpino treebank (van der Beek et al.,
2002), taken from the CoNLL shared task of mul-
tilingual dependency parsing3. The CoNLL data
differs slightly from the original Alpino treebank
as the corpus has been part-of-speech tagged using
a Memory-Based-Tagger (Daelemans et al., 1996).
It consists of 13,300 sentences with an average
length of 14.6 tokens. The data is non-projective,
more speci cally 5.4% of all dependencies are
crossed by at least one other dependency. It con-
tains approximately 6000 sentences more than the
Alpino corpus used by Malouf and van Noord
(2004).
The training set was divided into a 10% devel-
opment set (dev) while the remaining 90% is used
as a training and cross-validation set (cross). Fea-
ture sets, constraints and training parameters were
selected through training on cross and optimising
against dev.
Our  nal accuracy scores and runtime eval-
uations were acquired using a nine-fold cross-
validation on cross
4.2 Environment and Implementation
All our experiments were conducted on a Intel
Xeon with 3.8 Ghz and 4Gb of RAM. We used
the open source Mixed Integer Programming li-
brary lp solve4 to solve the Integer Linear Pro-
grams. Our code ran in Java and called the JNI-
wrapper around the lp solve library.
4.3 Feature Sets
Our feature set was determined through experi-
mentation with the development set. The features
are based upon the data provided within the Alpino
treebank. Along with POS tags the corpus contains
several additional attributes such as gender, num-
ber and case.
Our best results on the development set were
achieved using the feature set of McDonald et al.
(2005a) and a set of features based on the addi-
tional attributes. These features combine the at-
tributes of the head with those of the child. For
example, if token i has the attributes a1 and a2,
and token j has the attribute a3 then we created
the features (a1∧a3) and (a2∧a3).
3For details see http://nextens.uvt.nl/
˜conll.
4The software is available from http://www.
geocities.com/lpsolve.
4.4 Constraints
All the constraints presented in Section 3 were
used in our model. The set U of unique labels
constraints contained su, obj1, obj2, sup, ld, vc,
predc, predm, pc, pobj1, obcomp and body. Here
su stands for subject and obj1 for direct object (for
full details see Moortgat et al. (2000)).
The set of projective labels P contained cnj,
for coordination dependencies; and det, for de-
terminer dependencies. One exception was added
for the coordination constraint: dependencies can
cross when coordinated arguments are verbs.
One drawback of hard deterministic constraints
is the undesirable effect noisy data can cause. We
see this most prominently with coordination argu-
ment compatibility. Words ending in  en are typ-
ically wrongly tagged and cause our coordination
argument constraint to discard correct coordina-
tions. As a workaround we assigned words ending
in  en a wildcard POS tag which is compatible
with all POS tags.
5 Results
In this section we report our results. We not only
present our accuracy but also provide an empiri-
cal evaluation of the runtime behaviour of this ap-
proach and show how parsing can be accelerated
using a simple approximation.
5.1 Accuracy
An important question to answer when using
global constraints is: How much of a performance
boost is gained when using global constraints?
We ran the system without any linguistic con-
straints as a baseline (bl) and compared it to a
system with the additional constraints (cnstr). To
evaluate our systems we use the accuracy over la-
belled attachment decisions:
LAC = NlN
t
where Nl is the number of tokens with correct
head and label and Nt is the total number of to-
kens. For completeness we also report the unla-
belled accuracy:
UAC = NuN
t
where Nu is the number of tokens with correct
head.
134
LAC UAC LC UC
bl 84.6% 88.9% 27.7% 42.2%
cnstr 85.1% 89.4% 29.7% 43.8%
Table 1: Labelled (LAC) and unlabelled (UAC) ac-
curacy using nine-fold cross-validation on cross
for baseline (bl) and constraint-based (constr) sys-
tem. LC and UC are the percentages of sentences
with 100% labelled and unlabelled accuracy, re-
spectively.
Table 1 shows our results using nine-fold cross-
validation on the cross set. The baseline system
(no additional constraints) gives an unlabelled ac-
curacy of 84.6% and labelled accuracy of 88.9%.
When we add our linguistic constraints the per-
formance increases by 0.5% for both labelled and
unlabelled accuracy. This increase is signi cant
(p < 0.001) according to Dan Bikel’s parse com-
parison script and using the Sign test (p < 0.001).
Now we give a little insight into how our results
compare with the rest of the community. The re-
ported state-of-the-art parser of Malouf and van
Noord (2004) achieves 84.4% labelled accuracy
which is very close numerically to our baseline.
However, they use a subset of the CoNLL Alpino
treebank with a higher average number of tokens
per sentences and also evaluate control relations,
thus results are not directly comparable. We have
also run our parser on the relatively small (approx-
imately 400 sentences) CoNNL test data. The best
performing system (McDonald et al. 2006; note:
this system is different to our baseline) achieves
79.2% labelled accuracy while our baseline sys-
tem achieves 78.6% and our constrained version
79.8%. However, a signi cant difference is only
observed between our baseline and our constraint-
based system.
Examining the errors produced using the dev
set highlight two key reasons why we do not see
a greater improvement using our constraint-based
system. Firstly, we cannot improve on coordina-
tions that include words ending with  en based on
the workaround present in Section 4.4. This prob-
lem can only be solved by improving POS taggers
for Dutch or by performing POS tagging within
the dependency parsing framework.
Secondly, our system suffers from poor next
best solutions. That is, if the best solution violates
some constraints, then we  nd the next best solu-
tion is typically worse than the best solution with
violated constraints. This appears to be a conse-
quence of inaccurate local score distributions (as
opposed to inaccurate best local scores). For ex-
ample, suppose we attach two subjects, t1 and t2,
to a verb, where t1 is the actual subject while t2
is meant to be labelled as object. If we forbid this
con guration (two subjects) and if the score of la-
belling t1 object is higher than that for t2 being
labelled subject, then the next best solution will
label t1 incorrectly as object and t2 incorrectly as
subject. This is often the case, and thus results in a
drop of accuracy.
5.2 Runtime Evaluation
We now concentrate on the runtime of our method.
While we expect a longer runtime than using the
Chu-Liu-Edmonds as in previous work (McDon-
ald et al., 2005b), we are interested in how large
the increase is.
Table 2 shows the average solve time (ST) for
sentences with respect to the number of tokens in
each sentence for our system with constraints (cn-
str) and the Chu-Liu-Edmonds (CLE) algorithm.
All solve times do not include feature extraction
as this is identical for all systems. For cnstr we
also show the number of sentences that could not
be parsed after two minutes, the average number
of iterations and the average duration of the  rst
iteration.
The results show that parsing using our generic
approach is still reasonably fast although signi -
cantly slower than using the Chu-Liu-Edmonds al-
gorithm. Also, only a small number of sentences
take longer than two minutes to parse. Thus, in
practice we would not see a signi cant degrada-
tion in performance if we were to fall back on the
CLE algorithm after two minutes of solving.
When we examine the average duration of the
 rst iteration it appears that the majority of the
solve time is spent within this iteration. This could
be used to justify using the CLE algorithm to  nd
a initial solution as starting point for the ILP solver
(see Section 6).
5.3 Approximation
Despite the fact that our parser can parse all sen-
tences in a reasonable amount of time, it is still sig-
ni cantly slower than the CLE algorithm. While
this is not crucial during decoding, it does make
discriminative online training dif cult as training
requires several iterations of parsing the whole
corpus.
135
Tokens 1-10 11-20 21-30 31-40 41-50 >50
Count 5242 4037 1835 650 191 60
Avg. ST (CLE) 0.27ms 0.98ms 3.2ms 7.5ms 14ms 23ms
Avg. ST (cnstr) 5.6ms 52ms 460ms 1.5s 7.2s 33s
ST > 120s (cnstr) 0 0 0 0 3 3
Avg. # iter. (cnstr) 2.08 2.87 4.48 5.82 8.40 15.17
Avg. ST 1st iter. (cnstr) 4.2ms 37ms 180ms 540ms 1.3s 2.6s
Table 2: Runtime evaluation for different sentence lengths. Average solve time (ST) for our system
with constraints (constr), the Chu-Liu-Edmonds algorithm (CLE), number of sentences with solve times
greater than 120 seconds, average number of iterations and  rst iteration solve time.
q=5 q=10 all CLE
LAC 84.90% 85.11% 85.14% 85.14%
ST 351s 760s 3640s 20s
Table 3: Labelled accuracy (LAC) and total solve
time (ST) for the cross dataset using varying q val-
ues and the Chu-Liu-Edmonds algorithm (CLE)
Thus we investigate if it is possible to speed up
our inference using a simple approximation. For
each token we now only consider the q variables
in Vx with the highest scoring edges. For exam-
ple, if we set q = 2 the set of variables for a to-
ken j will contain two variables, either both for
the same head i but with different labels (variables
ei,j,l1 and ei,j,l2) or two distinct heads i1 and i2
(variables ei1,j,l1 and ei2,j,l2) where labels l1 and
l2 may be identical.
Table 3 shows the effect of different q values
on solve time for the full corpus cross (roughly
12,000 sentences) and overall accuracy. We see
that solve time can be reduced by 80% while only
losing a marginal amount of accuracy when we set
q to 10. However, we are unable to reach the 20
seconds solve time of the CLE algorithm. Despite
this, when we add the time for preprocessing and
feature extraction, the CLE system parses a cor-
pus in around 15 minutes whereas our system with
q = 10 takes approximately 25 minutes5. When
we view the total runtime of each system we see
our system is more competitive.
6 Discussion
While we have presented signi cant improve-
ments using additional constraints, one may won-
5Even when caching feature extraction during training
McDonald et al. (2005a) still takes approximately 10 minutes
to train.
der whether the improvements are large enough
to justify further research in this direction; espe-
cially since McDonald and Pereira (2006) present
an approximate algorithm which also makes more
global decisions. However, we believe that our ap-
proach is complementary to their model. We can
model higher order features by using an extended
set of variables and a modi ed objective function.
Although this is likely to increase runtime, it may
still be fast enough for real world applications. In
addition, it will allow exact inference, even in the
case of non-projective parsing. Also, we argue that
this approach has potential for interesting exten-
sions and applications.
For example, during our runtime evaluations we
 nd that a large fraction of solve time is spent in
the  rst iteration of our incremental algorithm. Af-
ter the  rst iteration the solver uses its last state to
ef ciently search for solutions in the presence of
new constraints. Some solvers allow the speci ca-
tion of an initial solution as a starting point, thus it
is expected that signi cant improvements in terms
of speed can be made by using the CLE algorithm
to provide an initial solution.
Our approach uses a generic algorithm to solve
a complex task. Thus other applications may ben-
e t from it. For instance, Germann et al. (2001)
present an ILP formulation of the Machine Trans-
lation (MT) decoding task in order to conduct ex-
act inference. However, their model suffers from
the same type of exponential blow-up we observe
when we add all our cycle constraints in advance.
In fact, the constraints which cause the exponential
explosion in their graphically formulation are of
the same nature as our cycle constraints. We hope
that the incremental approach will allow exact MT
decoding for longer sentences.
136
7 Conclusion
In this paper we have presented a novel ap-
proach for inference using ILP. While previous ap-
proaches which use ILP for decoding have solved
each integer linear program in one run, we incre-
mentally add constraints and solve the resulting
program until no more constraints are violated.
This allows us to ef ciently use ILP for depen-
dency parsing and add constraints which provide
a signi cant improvement over the current state-
of-the-art parser (McDonald et al., 2005b) on the
Dutch Alpino corpus (see bl row in Table 1).
Although slower than the baseline approach,
our method can still parse large sentences (more
than 50 tokens) in a reasonable amount of time
(less than a minute). We have shown that pars-
ing time can be signi cantly reduced using a
simple approximation which only marginally de-
grades performance. Furthermore, we believe that
the method has potential for further extensions and
applications.
Acknowledgements
Thanks to Ivan Meza-Ruiz, Ruken C‚ ak c , Beata
Kouchnir and Abhishek Arun for their contribu-
tion during the CoNLL shared task and to Mirella
Lapata for helpful comments and suggestions.

References
Culotta, Aron and Jeffery Sorensen. 2004. Dependency tree
kernels for relation extraction. In 42nd Annual Meeting of
the Association for Computational Linguistics. Barcelona,
Spain, pages 423 429.
Daelemans, W., J. Zavrel, and S. Berck. 1996. MBT: A
memory-based part of speech tagger-generator. In Pro-
ceedings of the Fourth Workshop on Very Large Corpora.
pages 14 27.
Ding, Yuan and Martha Palmer. 2005. Machine transla-
tion using probabilistic synchronous dependency insertion
grammars. In The 43rd Annual Meeting of the Association
of Computational Linguistics. Ann Arbor, MI, USA, pages
541 548.
Germann, Ulrich, Michael Jahr, Kevin Knight, Daniel Marcu,
and Kenji Yamada. 2001. Fast decoding and optimal de-
coding for machine translation. In Meeting of the Asso-
ciation for Computational Linguistics. Toulouse, France,
pages 228 235.
Grcurrency1otschel, M., L. Lov·asz, and A. Schrijver. 1981. The ellip-
soid method and its consequences in combina- torial opti-
mization. Combinatorica I:169 197.
Malouf, Robert and Gertjan van Noord. 2004. Wide cover-
age parsing with stochastic attribute value grammars. In
Proc. of IJCNLP-04 Workshop ”Beyond Shallow Analy-
ses”. Sanya City, Hainan Island, China.
McDonald, R., K. Crammer, and F. Pereira. 2005a. Online
large-margin training of dependency parsers. In 43rd An-
nual Meeting of the Association for Computational Lin-
guistics. Ann Arbor, MI, USA, pages 91 98.
McDonald, R. and F. Pereira. 2006. Online learning of ap-
proximate dependency parsing algorithms. In 11th Con-
ference of the European Chapter of the Association for
Computational Linguistics. Trento, Italy, pages 81 88.
McDonald, R., F. Pereira, K. Ribarov, and J. Hajic. 2005b.
Non-projective dependency parsing using spanning tree
algorithms. In Proceedings of Human Language Technol-
ogy Conference and Conference on Empirical Methods in
Natural Language Processing. Association for Computa-
tional Linguistics, Vancouver, British Columbia, Canada,
pages 523 530.
McDonald, Ryan, Kevin Lerman, and Fernando Pereira.
2006. Multilingual dependency analysis with a two-stage
discriminative parser. In Proceedings of CoNLL-2006.
New York, USA.
Moortgat, M., I. Schuurman, and T. van der Wouden.
2000. Cgn syntactische annotatie. Internal report Corpus
Gesproken Nederlands.
Nivre, J., J. Hall, and J. Nilsson. 2004. Memory-based depen-
dency parsing. In Proceedings of CoNLL-2004. Boston,
MA, USA, pages 49 56.
Roth, D. and W. Yih. 2004. A linear programming formu-
lation for global inference in natural language tasks. In
Proceedings of CoNLL-2004,. Boston, MA, USA, pages
1 8.
van der Beek, L., G. Bouma, R. Malouf, G. van Noord,
Leonoor van der Beek, Gosse Bouma, Robert Malouf, and
Gertjan van Noord. 2002. The Alpino dependency tree-
bank. In Computational Linguistics in the Netherlands
(CLIN). Rodopi.
Warme, David Michael. 1998. Spanning Trees in Hyper-
graphs with Application to Steiner Trees. Ph.D. thesis,
University of Virginia.
Williams, Justin C. 2002. A linear-size zero - one program-
ming model for the minimum spanning tree problem in
planar graphs. Networks 39:53 60.
Winston, Wayne L. and Munirpallam Venkataramanan.
2003. Introduction to Mathematical Programming.
Brooks/Cole.
Yamada, Hiroyasu and Yuji Matsumoto. 2003. Statistical de-
pendency analysis with support vector machines. In Pro-
ceedings of IWPT. pages 195 206.
