Proceedings of the 10th Conference on Computational Natural Language Learning (CoNLL-X),
pages 226–230, New York City, June 2006. c©2006 Association for Computational Linguistics
Multi-lingual Dependency Parsing with Incremental Integer Linear
Programming
Sebastian Riedel and Ruket C¸ akıcı and Ivan Meza-Ruiz
ICCS
School of Informatics
University of Edinburgh
Edinburgh, EH8 9LW, UK
S.R.Riedel,R.Cakici,I.V.Meza-Ruiz@sms.ed.ac.uk
Abstract
Our approach to dependency parsing is
based on the linear model of McDonald
et al.(McDonald et al., 2005b). Instead of
solving the linear model using the Max-
imum Spanning Tree algorithm we pro-
pose an incremental Integer Linear Pro-
gramming formulation of the problem that
allows us to enforce linguistic constraints.
Our results show only marginal improve-
ments over the non-constrained parser. In
addition to the fact that many parses did
not violate any constraints in the first place
this can be attributed to three reasons: 1)
the next best solution that fulfils the con-
straints yields equal or less accuracy, 2)
noisy POS tags and 3) occasionally our
inference algorithm was too slow and de-
coding timed out.
1 Introduction
This paper presents our submission for the CoNLL
2006 shared task of multilingual dependency pars-
ing. Our parser is inspired by McDonald et
al.(2005a) which treats the task as the search for the
highest scoring Maximum Spanning Tree (MST) in
a graph. This framework is efficient for both pro-
jective and non-projective parsing and provides an
online learning algorithm which combined with a
rich feature set creates state-of-the-art performance
across multiple languages (McDonald and Pereira,
2006).
However, McDonald and Pereira (2006) mention
the restrictive nature of this parsing algorithm. In
their original framework, features are only defined
over single attachment decisions. This leads to cases
where basic linguistic constraints are not satisfied
(e.g. verbs with two subjects). In this paper we
present a novel way to implement the parsing al-
gorithms for projective and non-projective parsing
based on a more generic incremental Integer Linear
Programming (ILP) approach. This allows us to in-
clude additional global constraints that can be used
to impose linguistic information.
The rest of the paper is organised in the following
way. First we give an overview of the Integer Linear
Programming model and how we trained its param-
eters. We then describe our feature and constraint
sets for the 12 different languages of the task (Hajiˇc
et al., 2004; Chen et al., 2003; B¨ohmov´a et al., 2003;
Kromann, 2003; van der Beek et al., 2002; Brants
et al., 2002; Kawata and Bartels, 2000; Afonso et
al., 2002; Dˇzeroski et al., 2006; Civit Torruella and
Mart´ı Anton´ın, 2002; Nilsson et al., 2005; Oflazer et
al., 2003; Atalay et al., 2003). Finally, our results are
discussed and error analyses for Chinese and Turk-
ish are presented.
2 Model
Our model is based on the linear model presented in
McDonald et al. (2005a),
s(x,y) = summationdisplay
(i,j)∈y
s(i,j) =summationdisplayw ·f (i,j)(1)
where x is a sentence, y a parse and s a score func-
tion over sentence-parse pairs. f (i,j) is a multidi-
226
mensional feature vector representation of the edge
from token i to token j and w the corresponding
weight vector. Decoding in this model amounts to
finding the y for a given x that maximises s(x,y)
yprime = argmaxys(x,y)
and y contains no cycles, attaches exactly one head
to each non-root token and no head to the root node.
2.1 Decoding
Instead of using the MST algorithm (McDonald et
al., 2005b) to maximise equation 1, we present an
equivalent ILP formulation of the problem. An ad-
vantage of a general purpose inference technique is
the addition of further linguistically motivated con-
straints. For instance, we can add constraints that
enforce that a verb can not have more than one sub-
ject argument or that coordination arguments should
have compatible types. Roth and Yih (2005) is
similarly motivated and uses ILP to deal with ad-
ditional hard constraints in a Conditional Random
Field model for Semantic Role Labelling.
There are several explicit formulations of the
MST problem as integer programs in the literature
(Williams, 2002). They are based on the concept of
eliminating subtours (cycles), cuts (disconnections)
or requiring intervertex flows (paths). However, in
practice these cause long solving times. While the
first two types yield an exponential number of con-
straints, the latter one scales cubically but produces
non-fractional solutions in its relaxed version, caus-
ing long runtime of the branch and bound algorithm.
In practice solving models of this form did not con-
verge after hours even for small sentences.
To get around this problem we followed an incre-
mental approach akin to Warme (1998). Instead of
adding constraints that forbid all possible cycles in
advance (this would result in an exponential num-
ber of constraints) we first solve the problem without
any cycle constraints. Only if the result contains cy-
cles we add constraints that forbid these cycles and
run the solver again. This process is repeated un-
til no more violated constraints are found. Figure 1
shows this algorithm.
Groetschel et al. (1981) showed that such an ap-
proach will converge after a polynomial number of
iterations with respect to the number of variables.
1. Solve IP Pi
2. Find violated constraints C in the solution of Pi
3. if C = ∅ we are done
4. Pi+1 = Pi ∪ C
5. i = i +1
6. goto (1)
Figure 1: Incremental Integer Linear Programming
In practice, this technique showed fast convergence
(less than 10 iterations) in most cases, yielding solv-
ing times of less than 0.5 seconds. However, for
some sentences in certain languages, such as Chi-
nese or Swedish, an optimal solution could not be
found after 500 iterations.
In the following section we present the bjective
function, variables and linear constraints that make
up the Integer Linear Program.
2.1.1 Variables
In the implementation1 of McDonald et al.
(2005b) dependency labels are handled by finding
the best scoring label for a given token pair so that
s(i,j) = max s(i,j,label)
goes into Equation 1. This is only exact as long as no
further constraints are added. Since our aim is to add
constraints our variables need to explicitly model la-
bel decisions. Therefore, we introduce binary vari-
ables
li,j,label∀i ∈ 0..n,j ∈ 1..n,label ∈ bestb (i,j)
where n is the number of tokens and the index 0
represents the root token. bestb (i,j) is the set of b
labels with maximal s(i,j,label). li,j,label equals 1
if there is a dependency with the label label between
token i (head) and j (child), 0 otherwise.
Furthermore, we introduce binary auxiliary vari-
ables
di,j∀i ∈ 0..n,j ∈ 1..n
representing the existence of a dependency between
tokens i and j. We connect these to the li,j,label vari-
ables by a constraint
di,j = summationdisplay
label
li,j,label
.
1Note, however, that labelled parsing is not described in the
publication.
227
2.1.2 Objective Function
Given the above variables our objective function
can be represented as
summationdisplay
i,j
summationdisplay
label∈bestk(i,j)
s(i,j,label) · li,j,label
with a suitable k.
2.1.3 Constraints Added in Advance
Only One Head In all our languages every token
has exactly one head. This yields
summationdisplay
i>0
di,j = 1
for non-root tokens j > 0 and
summationdisplay
i
di,0 = 0
for the artificial root node.
Typed Arity Constraints We might encounter so-
lutions of the basic model that contain, for instance,
verbs with two subjects. To forbid these we simply
augment our model with constraints such as
summationdisplay
j
li,j,subject ≤ 1
for all verbs i in a sentence.
2.1.4 Incremental Constraints
No Cycles If a solution contains one or more cy-
cles C we add the following constraints to our IP:
For every c ∈ C we add
summationdisplay
(i,j)∈c
di,j ≤ |c|− 1
to forbid c.
Coordination Argument Constraints In coordi-
nation conjuncts have to be of compatible types. For
example, nouns can not coordinate with verbs. We
implemented this constraint by checking the parses
for occurrences of incompatible arguments. If we
find two arguments j,k for a conjunction i: di,j and
di,k and j is a noun and k is a verb then we add
di,j + di,k ≤ 1
to forbid configurations in which both dependencies
are active.
Projective Parsing In the incremental ILP frame-
work projective parsing can be easily implemented
by checking for crossing dependencies after each it-
eration and forbidding them in the next. If we see
two dependencies that cross, di,j and dk,l, we add
the constraint
di,j + dk,l ≤ 1
to prevent this in the next iteration. This can also
be used to prevent specific types of crossings. For
instance, in Dutch we could only allow crossing de-
pendencies as long as none of the dependencies is a
“Determiner” relation.
2.2 Training
We used single-best MIRA(Crammer and Singer,
2003).For all experiments we used 10 training iter-
ations and non-projective decoding. Note that we
used the original spanning tree algorithm for decod-
ing during training as it was faster.
3 System Summary
We use four different feature sets. The first fea-
ture set, BASELINE, is taken from McDonald and
Pereira (2005b). It uses the FORM and the POSTAG
fields. This set also includes features that combine
the label and POS tag of head and child such as
(Label,POSHead) and (Label,POSChild−1). For
our Arabic and Japanese development sets we ob-
tained the best results with this configuration. We
also use this configuration for Chinese, German and
Portuguese because training with other configura-
tions took too much time (more than 7 days).
The BASELINE also uses pseudo-coarse-POS tag
(1st character of the POSTAG) and pseudo-lemma
tag (4 characters of the FORM when the length
is more than 3). For the next configuration we
substitute these pseudo-tags by the CPOSTAG and
LEMMA fields that were given in the data. This con-
figuration was used for Czech because for other con-
figurations training could not be finished in time.
The third feature set tries to exploit the generic
FEATS field, which can contain a list features such
as case and gender. A set of features per depen-
dency is extracted using this information. It con-
sists of cross product of the features in FEATS. We
used this configuration for Danish, Dutch, Spanish
228
and Turkish where it showed the best results during
development.
The fourth feature set uses the triplet of la-
bel, POS child and head as a feature such as
(Label,POSHead,POSChild). It also uses the
CPOSTAG and LEMMA fields for the head. This
configuration is used for Slovene and Swedish data
where it performed best during development.
Finally, we add constraints for Chinese, Dutch,
Japanese and Slovene. In particular, arity constraints
to Chinese and Slovene, coordination and arity con-
straints to Dutch, arity and selective projectivity
constraints for Japanese2. For all experiments b was
set to 2. We did not apply additional constraints to
any other languages due to lack of time.
4 Results
Our results on the test set are shown in Table 1.
Our results are well above the average for all lan-
guages but Czech. For Chinese we perform signif-
icantly better than all other participants (p = 0.00)
and we are in the top three entries for Dutch, Ger-
man, Danish. Although Dutch and Chinese are lan-
guages were we included additional constraints, our
scores are not a result of these. Table 2 compares the
result for the languages with additional constraints.
Adding constraints only marginally helps to improve
the system (in the case of Slovene a bug in our im-
plementation even degraded accuracy). A more de-
tailed explanation to this observation is given in the
following section. A possible explanation for our
high accuracy in Chinese could be the fact that we
were not able to optimise the feature set on the de-
velopment set (see the previous section). Maybe this
prevented us from overfitting. It should be noted that
we did use non-projective parsing for Chinese, al-
though the corpus was fully projective. Our worst
results in comparison with other participants can be
seen for Czech. We attribute this to the reduced
training set we had to use in order to produce a
model in time, even when using the original MST
algorithm.
2This is done in order to capture the fact that crossing de-
pendencies in Japanese could only be introduced through dis-
fluencies.
4.1 Chinese
For Chinese the parser was augmented with a set of
constraints that disallowed more than one argument
of the types head, goal, nominal, range, theme, rea-
son, DUMMY, DUMMY1 and DUMMY2.
By enforcing arity constraints we could either turn
wrong labels/heads into right ones and improve ac-
curacy or turn right labels/heads into wrong ones and
degrade accuracy. For the test set the number of im-
provements (36) was higher than the number of er-
rors (22). However, this margin was outweighed by
a few sentences we could not properly process be-
cause our inference method timed out. Our overall
improvement was thus unimpressive 7 tokens.
In the context of duplicate “head” dependencies
(that is, dependencies labelled “head”) the num-
ber of sentences where accuracy dropped far out-
weighed the number of sentences where improve-
ments could be gained. Removing the arity con-
straints on “head” labels therefore should improve
our results.
This shows the importance of good second best
dependencies. If the dependency with the second
highest score is the actual gold dependency and its
score is close to the highest score, we are likely to
pick this dependency in the presence of additional
constraints. On the other hand, if the dependency
with the second highest score is not the gold one and
its score is too high, we will probably include this
dependency in order to fulfil the constraints.
There may be some further improvement to be
gained if we train our model using k-best MIRA
with k > 1 since it optimises weights with respect
to the k best parses.
4.2 Turkish
There is a considerable gap between the unlabelled
and labelled results for Turkish. And in terms of la-
bels the POS type Noun gives the worst performance
because many times a subject was classified as ob-
ject or vice a versa.
Case information in Turkish assigns argument
roles for nouns by marking different semantic roles.
Many errors in the Turkish data might have been
caused by the fact that this information was not ad-
equately used. Instead of fine-tuning our feature set
to Turkish we used the feature cross product as de-
229
Model AR CH CZ DA DU GE JP PO SL SP SW TU
OURS 66.65 89.96 67.64 83.63 78.59 86.24 90.51 84.43 71.20 77.38 80.66 58.61
AVG 59.94 78.32 67.17 78.31 70.73 78.58 85.86 80.63 65.16 73.53 76.44 55.95
TOP 66.91 89.96 80.18 84.79 79.19 87.34 91.65 87.60 73.44 82.25 84.58 65.68
Table 1: Labelled accuracy on the test sets.
Constraints DU CH SL JA
with 3927 4464 3612 4526
without 3928 4471 3563 4528
Table 2: Number of tokens correctly classified with
and without constraints.
scribed in Section 3. Some of the rather meaning-
less combinations might have neutralised the effect
of sensible ones. We believe that using morpho-
logical case information in a sound way would im-
prove both the unlabelled and the labelled dependen-
cies. However, we have not performed a separate ex-
periment to test if using the case information alone
would improve the system any better. This could be
the focus of future work.
5 Conclusion
In this work we presented a novel way of solving the
linear model of McDonald et al. (2005a) for projec-
tive and non-projective parsing based on an incre-
mental ILP approach. This allowed us to include
additional linguistics constraints such as “a verb can
only have one subject.”
Due to time constraints we applied additional
constraints to only four languages. For each one
we gained better results than the baseline without
constraints, however, this improvement was only
marginal. This can be attributed to 4 main rea-
sons: Firstly, the next best solution that fulfils the
constraints was even worse (Chinese). Secondly,
noisy POS tags caused coordination constraints to
fail (Dutch). Thirdly, inference timed out (Chinese)
and fourthly, constraints were not violated that often
in the first place (Japanese).
However, the effect of the first problem might be
reduced by training with a higher k. The second
problem could partly be overcome by using a bet-
ter tagger or by a special treatment within the con-
straint handling for word types which are likely to
be mistagged. The third problem could be avoidable
by adding constraints during the branch and bound
algorithm, avoiding the need to resolve the full prob-
lem “from scratch” for every constraint added. With
these remedies significant improvements to the ac-
curacy for some languages might be possible.
6 Acknowledgements
We would like to thank Beata Kouchnir, Abhishek
Arun and James Clarke for their help during the
course of this project.

References
Koby Crammer and Yoram Singer. 2003. Ultraconservative
online algorithms for multiclass problems. J. Mach. Learn.
Res., 3:951–991.
M. Groetschel, L. Lovasz, and A. Schrijver. 1981. The ellipsoid
method and its consequences in combinatorial optimization.
Combinatorica, I:169– 197.
R. McDonald and F. Pereira. 2006. Online learning of approx-
imate dependency parsing algorithms. In Proc. of the 11th
Annual Meeting of the EACL.
R. McDonald, K. Crammer, and F. Pereira. 2005a. Online
large-margin training of dependency parsers. In Proc. of the
43rd Annual Meeting of the ACL.
Ryan McDonald, Fernando Pereira, Kiril Ribarov, and Jan Ha-
jic. 2005b. Non-projective dependency parsing using span-
ning tree algorithms. In Proceedings of HLT/EMNLP 2005,
Vancouver, B.C., Canada.
D. Roth and W. Yih. 2005. Integer linear programming in-
ference for conditional random fields. In Proc. of the In-
ternational Conference on Machine Learning (ICML), pages
737–744.
David Michael Warme. 1998. Spanning Trees in Hypergraphs
with Application to Steiner Trees. Ph.D. thesis, University of
Virginia.
Justin C. Williams. 2002. A linear-size zero - one program-
ming model for the minimum spanning tree problem in pla-
nar graphs. Networks, 39:53–60.
