Proceedings of the 10th Conference on Computational Natural Language Learning (CoNLL-X),
pages 181–185, New York City, June 2006. c©2006 Association for Computational Linguistics
Projective Dependency Parsing with Perceptron
Xavier Carreras, Mihai Surdeanu and Llu´ıs M`arquez
TALP Research Centre – Software Department (LSI)
Technical University of Catalonia (UPC)
Campus Nord - Edifici Omega, Jordi Girona Salgado 1–3, E-08034 Barcelona
{carreras,surdeanu,lluism}@lsi.upc.edu
Abstract
We describe an online learning depen-
dency parser for the CoNLL-X Shared
Task, based on the bottom-up projective
algorithm of Eisner (2000). We experi-
ment with a large feature set that mod-
els: the tokens involved in dependencies
and their immediate context, the surface-
text distance between tokens, and the syn-
tactic context dominated by each depen-
dency. In experiments, the treatment of
multilingual information was totally blind.
1 Introduction
We describe a learning system for the CoNLL-X
Shared Task on multilingual dependency parsing
(Buchholz et al., 2006), for 13 different languages.
Our system is a bottom-up projective dependency
parser, based on the cubic-time algorithm by Eisner
(1996; 2000). The parser uses a learning function
that scores all possible labeled dependencies. This
function is trained globally with online Perceptron,
by parsing training sentences and correcting its pa-
rameters based on the parsing mistakes. The features
used to score, while based on the previous work in
dependency parsing (McDonald et al., 2005), intro-
duce some novel concepts such as better codification
of context and surface distances, and runtime infor-
mation from dependencies previously parsed.
Regarding experimentation, the treatment of mul-
tilingual data has been totally blind, with no spe-
cial processing or features that depend on the lan-
guage. Considering its simplicity, our system
achieves moderate but encouraging results, with an
overall labeled attachment accuracy of 74.72% on
the CoNLL-X test set.
2 Parsing and Learning Algorithms
This section describes the three main components of
the dependency parsing: the parsing model, the pars-
ing algorithm, and the learning algorithm.
2.1 Model
Let 1,...,L be the dependency labels, defined be-
forehand. Let x be a sentence of n words, x1 ...xn.
Finally, let Y(x) be the space of well-formed depen-
dency trees for x. A dependency tree y ∈ Y(x) is a
set of n dependencies of the form [h,m,l], where
h is the index of the head word (0 ≤ h ≤ n,
where 0 means root), m is the index of the modi-
fier word (1 ≤ m ≤ n), and l is the dependency
label (1 ≤ l ≤ L). Each word of x participates as a
modifier in exactly one dependency of y.
Our dependency parser, dp, returns the maximum
scored dependency tree for a sentence x:
dp(x,w) = argmax
y∈Y(x)
summationdisplay
[h,m,l]∈y
sco([h,m,l],x,y,w)
In the formula, w is the weight vector of the
parser, that is, the set of parameters used to score de-
pendencies during the parsing process. It is formed
by a concatenation of L weight vectors, one for each
dependency label, w = (w1,...,wl,...,wL). We
assume a feature extraction function, φ, that repre-
sents an unlabeled dependency [h,m] in a vector of
D features. Each of the wl has D parameters or
dimensions, one for each feature. Thus, the global
181
weight vector w maintains L × D parameters. The
scoring function is defined as follows:
sco([h,m,l],x,y,w) = φ(h,m,x,y)· wl
Note that the scoring of a dependency makes use
of y, the tree that contains the dependency. As de-
scribed next, at scoring time y just contains the de-
pendencies found between h and m.
2.2 Parsing Algorithm
We use the cubic-time algorithm for dependency
parsing proposed by Eisner (1996; 2000). This pars-
ing algorithm assumes that trees are projective, that
is, dependencies never cross in a tree. While this as-
sumption clearly does not hold in the CoNLL-X data
(only Chinese trees are actually 100% projective),
we chose this algorithm for simplicity. As it will be
shown, the percentage of non-projective dependen-
cies is not very high, and clearly the error rates we
obtain are caused by other major factors.
The parser is a bottom-up dynamic programming
algorithm that visits sentence spans of increasing
length. In a given span, from word s to word e, it
completes two partial dependency trees that cover
all words within the span: one rooted at s and the
other rooted at e. This is done in two steps. First, the
optimal dependency structure internal to the span is
chosen, by combining partial solutions from inter-
nal spans. This structure is completed with a depen-
dency covering the whole span, in two ways: from
s to e, and from e to s. In each case, the scoring
function is used to select the dependency label that
maximizes the score.
We take advantage of this two-step processing to
introduce features for the scoring function that rep-
resent some of the internal dependencies of the span
(see Section 3 for details). It has to be noted that
the parsing algorithm we use does not score depen-
dencies on top of every possible internal structure.
Thus, by conditioning on features extracted from y
we are making the search approximative.
2.3 Perceptron Learning
As learning algorithm, we use Perceptron tailored
for structured scenarios, proposed by Collins (2002).
In recent years, Perceptron has been used in a num-
ber of Natural Language Learning works, such as in
w = 0
for t = 1 to T
foreach training example (x,y) do
ˆy = dp(x,w)
foreach [h,m,l] ∈ y\ˆy do
wl = wl +φ(h,m,x, ˆy)
foreach [h,m,l] ∈ ˆy\y do
wl = wl −φ(h,m,x, ˆy)
return w
Figure 1: Pseudocode of the Perceptron Algorithm. T is a
parameter that indicates the number of epochs that the algorithm
cycles the training set.
partial parsing (Carreras et al., 2005) or even depen-
dency parsing (McDonald et al., 2005).
Perceptron is an online learning algorithm that
learns by correcting mistakes made by the parser
when visiting training sentences. The algorithm is
extremely simple, and its cost in time and memory
is independent from the size of the training corpora.
In terms of efficiency, though, the parsing algorithm
must be run at every training sentence.
Our system uses the regular Perceptron working
in primal form. Figure 1 sketches the code. Given
the number of languages and dependency types in
the CoNLL-X exercise, we found prohibitive to
work with a dual version of Perceptron, that would
allow the use of a kernel function to expand features.
3 Features
The feature extraction function, φ(h,m,x,y), rep-
resents in a feature vector a dependency from word
positions m to h, in the context of a sentence x and a
dependency tree y. As usual in discriminative learn-
ing, we work with binary indicator features: if a cer-
tain feature is observed in an instance, the value of
that feature is 1; otherwise, the value is 0. For con-
venience, we describe φ as a composition of several
base feature extraction functions. Each extracts a
number of disjoint features. The feature extraction
function φ(h,m,x,y) is calculated as:
φtoken(x,h,“head”) + φtctx(x,h,“head”) +
φtoken(x,m,“mod”) + φtctx(x,m,“mod”) +
φdep(x,mmdh,m) + φdctx(x,mmdh,m) +
φdist(x,mmdh,m) + φruntime(x,y,h,m,dh,m)
where φtoken extracts context-independent token
features, φtctx computes context-based token fea-
tures, φdep computes context-independent depen-
182
φtoken(x,i,type)
type·w(xi)
type·l(xi)
type·cp(xi)
type·fp(xi)
foreach(ms): type·ms(xi)
type·w(xi)·cp(xi)
foreach(ms): type·w(xi)·ms(xi)
φtctx(x,i,type)
φtoken(x,i−1,type·string(i−1))
φtoken(x,i−2,type·string(i−2))
φtoken(x,i+1,type·string(i+1))
φtoken(x,i+2,type·string(i+2))
type·cp(xi)·cp(xi−1)
type·cp(xi)·cp(xi−1)·cp(xi−2)
type·cp(xi)·cp(xi+1)
type·cp(xi)·cp(xi+1)·cp(xi+2)
Table 1: Token features, both context-independent (φtoken)
and context-based (φtctx). type - token type, i.e. “head” or
“mod”, w - token word, l - token lemma, cp - token coarse part-
of-speech (POS) tag, fp - token fine-grained POS tag, ms -
token morpho-syntactic feature. The · operator stands for string
concatenation.
φdep(x,i,j,dir)
dir·w(xi)·cp(xi)·w(xj)·cp(xj)
dir·cp(xi)·w(xj)·cp(xj)
dir·w(xi)·w(xj)·cp(xj)
dir·w(xi)·cp(xi)·cp(xj)
dir·w(xi)·cp(xi)·w(xj)
dir·w(xi)·w(xj)
dir·cp(xi)·cp(xj)
φdctx(x,i,j,dir)
dir·cp(xi)·cp(xi+1)·cp(xj−1)·cp(xj)
dir·cp(xi−1)·cp(xi)·cp(xj−1)·cp(xj)
dir·cp(xi)·cp(xi+1)·cp(xj)·cp(xj+1)
dir·cp(xi−1)·cp(xi)·cp(xj)·cp(xj+1)
Table 2: Dependency features, both context-independent
(φdep) and context-based (φdctx), between two points i and j,
i < j. dir - dependency direction: left to right or right to left.
dency features, φdctx extracts contextual depen-
dency features, φdist calculates surface-distance fea-
tures between the two tokens, and finally, φruntime
computes dynamic features at runtime based on the
dependencies previously built for the given interval
during the bottom-up parsing. mmdh,m is a short-
hand for a triple of numbers: min(h,m), max(h,m)
and dh,m (a sign indicating the direction, i.e., +1 if
m < h, and −1 otherwise).
We detail the token features in Table 1, the depen-
dency features in Table 2, and the surface-distance
features in Table 3. Most of these features are in-
spired by previous work in dependency parsing (Mc-
Donald et al., 2005; Collins, 1999). What is impor-
φdist(x,i,j,dir)
foreach(k ∈ (i,j)): dir·cp(xi)·cp(xk)·cp(xj)
number of tokens between i and j
number of verbs between i and j
number of coordinations between i and j
number of punctuations signs between i and j
Table 3: Surface distance features between points i and j. Nu-
meric features are discretized using “binning” to a small number
of intervals.
φruntime(x,y,h,m,dir)
let l1,...,lS be the labels of dependencies
in y that attach to h and are found from m to h.
foreach i, 1≤i≤S : dir·cp(xh)·cp(xm)·li
if S≥1 , dir·cp(xh)·cp(xm)·l1
if S≥2 , dir·cp(xh)·cp(xm)·l1 ·l2
if S≥3 , dir·cp(xh)·cp(xm)·l1 ·l2 ·l3
if S≥4 , dir·cp(xh)·cp(xm)·l1 ·l2 ·l3 ·l4
if S=0 , dir·cp(xh)·cp(xm)·null
if 0<S≤4 , dir·cp(xh)·cp(xm)·regular
if S>4 , dir·cp(xh)·cp(xm)·big
Table 4: Runtime features of y between m and h.
tant for the work presented here is that we construct
explicit feature combinations (see above tables) be-
cause we configured our linear predictors in primal
form, in order to keep training times reasonable.
While the features presented in Tables 1, 2, and 3
are straightforward exploitations of the training data,
the runtime features (φruntime) take a different, and
to our knowledge novel in the proposed framework,
approach: for a dependency from m to h, they rep-
resent the dependencies found between m and h
that attach also to h. They are described in detail
in Table 4. As we have noted above, these fea-
tures are possible because of the parsing scheme,
which scores a dependency only after all dependen-
cies spanned by it are scored.
4 Experiments and Results
We experimented on the 13 languages proposed
in the CoNLL-X Shared Task (Hajiˇc et al., 2004;
Simov et al., 2005; Simov and Osenova, 2003; Chen
et al., 2003; B¨ohmov´a et al., 2003; Kromann, 2003;
van der Beek et al., 2002; Brants et al., 2002;
Kawata and Bartels, 2000; Afonso et al., 2002;
Dˇzeroski et al., 2006; Civit and Mart´ı, 2002; Nilsson
et al., 2005; Oflazer et al., 2003; Atalay et al., 2003).
Our approach to deal with many different languages
was totally blind: we did not inspect the data to mo-
tivate language-specific features or processes.
183
We did feature filtering based on frequency
counts. Our feature extraction patterns, that ex-
ploit both lexicalization and combination, gener-
ate millions of feature dimensions, even with small
datasets. Our criterion was to use at most 500,000
different dimensions in each label weight vector. For
each language, we generated all possible features,
and then filtered out most of them according to the
counts. Depending on the number of training sen-
tences, our counts cut-offs vary from 3 to 15.
For each language, we held out from training data
a portion of sentences (300, 500 or 1000 depend-
ing on the total number of sentences) and trained a
model for up to 20 epochs in the rest of the data. We
evaluated each model on the held out data for differ-
ent number of training epochs, and selected the op-
timum point. Then, we retrained each model on the
whole training set for the selected number of epochs.
Table 5 shows the attachment scores obtained
by our system, both unlabeled (UAS) and labeled
(LAS). The first column (GOLD) presents the LAS
obtained with a perfect scoring function: the loss in
accuracy is related to the projectivity assumption of
our parsing algorithm. Dutch turns out to be the
most non-projective language, with a loss in accu-
racy of 5.44%. In our opinion, the loss in other lan-
guages is relatively small, and is not a major limita-
tion to achieve a high performance in the task. Our
system achieves an overall LAS of 74.72%, with
substantial variation from one language to another.
Turkish, Arabic, Dutch, Slovene and Czech turn out
to be the most difficult languages for our system,
with accuracies below 70%. The easiest language
is clearly Japanese, with a LAS of 88.13%, followed
by Chinese, Portuguese, Bulgarian and German, all
with LAS above 80%.
Table 6 shows the contribution of base feature ex-
traction functions. For four languages, we trained
models that increasingly incorporate base functions.
It can be shown that all functions contribute to a bet-
ter score. Contextual features (φ3) bring the system
to the final order of performance, while distance (φ4)
and runtime (φ) features still yield substantial im-
provements.
5 Analysis and Conclusions
It is difficult to explain the difference in performance
across languages. Nevertheless, we have identified
GOLD UAS LAS
Bulgarian 99.56 88.81 83.30
Arabic 99.76 72.65 60.94
Chinese 100.0 88.65 83.68
Czech 97.78 77.44 68.82
Danish 99.18 85.67 79.74
Dutch 94.56 71.39 67.25
German 98.84 85.90 82.41
Japanese 99.16 90.79 88.13
Portuguese 98.54 87.76 83.37
Slovene 98.38 77.72 68.43
Spanish 99.96 80.77 77.16
Swedish 99.64 85.54 78.65
Turkish 98.41 70.05 58.06
Overall 98.68 81.19 74.72
Table 5: Results of the system on test data. GOLD: labeled
attachment score using gold scoring functions; the loss in ac-
curacy is caused by the projectivity assumption made by the
parser. UAS : unlabeled attachment score. LAS : labeled at-
tachment score, the measure to compare systems in CoNLL-X.
Bulgarian is excluded from overall scores.
φ1 φ2 φ3 φ4 φ
Turkish 33.02 48.00 55.33 57.16 58.06
Spanish 12.80 53.80 68.18 74.27 77.16
Portuguese 47.10 64.74 80.89 82.89 83.37
Japanese 38.78 78.13 86.87 88.27 88.13
Table 6: Labeled attachment scores at increasing feature con-
figurations. φ1 uses only φtoken at the head and modifier. φ2
extends φ1 with φdep. φ3 incorporates context features, namely
φtctx at the head and modifier, and φdctx. φ4 extends φ3 with
φdist. Finally, the final feature extraction function φ increases
φ4 with φruntime.
four generic factors that we believe caused the most
errors across all languages:
Size of training sets: the relation between the
amount of training data and performance is strongly
supported in learning theory. We saw the same re-
lation in this evaluation: for Turkish, Arabic, and
Slovene, languages with limited number of train-
ing sentences, our system obtains accuracies below
70%. However, one can not argue that the training
size is the only cause of errors: Czech has the largest
training set, and our accuracy is also below 70%.
Modeling large distance dependencies: even
though we include features to model the distance
between two dependency words (φdist), our analy-
sis indicates that these features fail to capture all the
intricacies that exist in large-distance dependencies.
Table 7 shows that, for the two languages analyzed,
the system performance decreases sharply as the dis-
tance between dependency tokens increases.
184
to root 1 2 3 − 6 >= 7
Spanish 83.04 93.44 86.46 69.97 61.48
Portuguese 90.81 96.49 90.79 74.76 69.01
Table 7: Fβ=1 score related to dependency token distance.
Modeling context: many attachment decisions, e.g.
prepositional attachment, depend on additional con-
text outside of the two dependency tokens. To ad-
dress this issue, we have included in our model fea-
tures to capture context, both static (φdctx and φtctx)
and dynamic (φruntime). Nevertheless, our error
analysis indicates that our model is not rich enough
to capture the context required to address complex
dependencies. All the top 5 focus words with the
majority of errors for Spanish and Portuguese – “y”,
“de”, “a”, “en”, and “que” for Spanish, and “em”,
“de”, “a”, “e”, and “para” for Portuguese – indicate
complex dependencies such as prepositional attach-
ments or coordinations.
Projectivity assumption: Dutch is the language
with most crossing dependencies in this evaluation,
and the accuracy we obtain is below 70%.
On the Degree of Lexicalization We conclude the
error analysis of our model with a look at the de-
gree of lexicalization in our model. A quick analy-
sis of our model on the test data indicates that only
34.80% of the dependencies for Spanish and 42.94%
of the dependencies for Portuguese are fully lexical-
ized, i.e. both the head and modifier words appear
in the model feature set (see Table 8). There are
two reasons that cause our model to be largely un-
lexicalized: (a) in order to keep training times rea-
sonable we performed heavy filtering of all features
based on their frequency, which eliminates many
lexicalized features from the final model, and (b)
due to the small size of most of the training cor-
pora, most lexicalized features simply do not ap-
pear in the testing section. Considering these re-
sults, a reasonable question to ask is: how much
are we losing because of this lack of lexical infor-
mation? We give an approximate answer by ana-
lyzing the percentage of fully-lexicalized dependen-
cies that are correctly parsed by our model. As-
suming that our model scales well, the accuracy on
fully-lexicalized dependencies is an indication for
the gain (or loss) to be had from lexicalization. Our
model parses fully-lexicalized dependencies with an
Fully One token Fully
lexicalized unlexicalized unlexicalized
Spanish 34.80% 54.77% 10.43%
Portuguese 42.94% 49.26% 7.80%
Table 8: Degree of dependency lexicalization.
accuracy of 74.81% LAS for Spanish (2.35% lower
than the overall score) and of 83.77% LAS for Por-
tuguese (0.40% higher than the overall score). This
analysis indicates that our model has limited gains
(if any) from lexicalization.
In order to improve the quality of our dependency
parser we will focus on previously reported issues
that can be addressed by a parsing model: large-
distance dependencies, better modeling of context,
and non-projective parsing algorithms.
Acknowledgements
This work was partially funded by the European Union Com-
mission (PASCAL - IST-2002-506778) and Spanish Ministry
of Science and Technology (TRANGRAM - TIN2004-07925-
C03-02). Mihai Surdeanu was supported by a Ramon y Cajal
fellowship of the latter institution.

References
S. Buchholz, E. Marsi, A. Dubey, and Y. Krymolowski. 2006.
CoNLL-X shared task on multilingual dependency parsing.
In Proc. of the Tenth Conf. on Computational Natural Lan-
guage Learning (CoNLL-X). SIGNLL.
X. Carreras, Llu´ıs M`arquez, and J. Castro. 2005. Filtering-
ranking perceptron learning for partial parsing. Machine
Learning, 1–3(60):41–71.
M. Collins. 1999. Head-Driven Statistical Models for Natural
Language Parsing. Ph.D. thesis, University of Pennsylvania.
M. Collins. 2002. Discriminative training methods for hidden
markov models: Theory and experiments with perceptron al-
gorithms. In Proc. of EMNLP-2002.
J. Eisner. 1996. Three new probabilistic models for depen-
dency parsing: An exploration. In Proc. of the 16th Intern.
Conf. on Computational Linguistics (COLING).
J. Eisner. 2000. Bilexical grammars and their cubic-time pars-
ing algorithms. In H. C. Bunt and A. Nijholt, editors, New
Developments in Natural Language Parsing, pages 29–62.
Kluwer Academic Publishers.
R. McDonald, K. Crammer, and F. Pereira. 2005. Online large-
margin training of dependency parsers. In Proc. of the 43rd
Annual Meeting of the ACL.
