Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the ACL, pages 257–264,
Sydney, July 2006. c©2006 Association for Computational Linguistics
Graph Transformations in Data-Driven Dependency Parsing
Jens Nilsson
V¨axj¨o University
jni@msi.vxu.se
Joakim Nivre
V¨axj¨o University and
Uppsala University
nivre@msi.vxu.se
Johan Hall
V¨axj¨o University
jha@msi.vxu.se
Abstract
Transforming syntactic representations in
order to improve parsing accuracy has
been exploited successfully in statistical
parsing systems using constituency-based
representations. In this paper, we show
that similar transformations can give sub-
stantial improvements also in data-driven
dependency parsing. Experiments on the
Prague Dependency Treebank show that
systematic transformations of coordinate
structures and verb groups result in a
10% error reduction for a deterministic
data-driven dependency parser. Combin-
ing these transformations with previously
proposed techniques for recovering non-
projective dependencies leads to state-of-
the-art accuracy for the given data set.
1 Introduction
It has become increasingly clear that the choice
of suitable internal representations can be a very
important factor in data-driven approaches to syn-
tactic parsing, and that accuracy can often be im-
proved by internal transformations of a given kind
of representation. This is well illustrated by the
Collins parser (Collins, 1997; Collins, 1999), scru-
tinized by Bikel (2004), where several transforma-
tions are applied in order to improve the analy-
sis of noun phrases, coordination and punctuation.
Other examples can be found in the work of John-
son (1998) and Klein and Manning (2003), which
show that well-chosen transformations of syntac-
tic representations can greatly improve the parsing
accuracy obtained with probabilistic context-free
grammars.
In this paper, we apply essentially the same
techniques to data-driven dependency parsing,
specifically targeting the analysis of coordination
and verb groups, two very common constructions
that pose special problems for dependency-based
approaches. The basic idea is that we can facili-
tate learning by transforming the training data for
the parser and that we can subsequently recover
the original representations by applying an inverse
transformation to the parser’s output.
The data used in the experiments come from
the Prague Dependency Treebank (PDT) (Hajiˇc,
1998; Hajiˇc et al., 2001), the largest avail-
able dependency treebank, annotated according to
the theory of Functional Generative Description
(FGD) (Sgall et al., 1986). The parser used is
MaltParser (Nivre and Hall, 2005; Nivre et al.,
2006), a freely available system that combines a
deterministic parsing strategy with discriminative
classifiers for predicting the next parser action.
The paper is structured as follows. Section 2
provides the necessary background, including a
definition of dependency graphs, a discussion of
different approaches to the analysis of coordina-
tion and verb groups in dependency grammar, as
well as brief descriptions of PDT, MaltParser and
some related work. Section 3 introduces a set
of dependency graph transformations, specifically
defined to deal with the dependency annotation
found in PDT, which are experimentally evaluated
in section 4. While the experiments reported in
section 4.1 deal with pure treebank transforma-
tions, in order to establish an upper bound on what
can be achieved in parsing, the experiments pre-
sented in section 4.2 examine the effects of differ-
ent transformations on parsing accuracy. Finally,
in section 4.3, we combine these transformations
with previously proposed techniques in order to
optimize overall parsing accuracy. We conclude
in section 5.
257
2 Background
2.1 Dependency Graphs
The basic idea in dependency parsing is that the
syntactic analysis consists in establishing typed,
binary relations, called dependencies, between the
words of a sentence. This kind of analysis can be
represented by a labeled directed graph, defined as
follows:
• Let R = {r1,...,rm} be a set of dependency
types (arc labels).
• A dependency graph for a string of words
W = w1...wn is a labeled directed graph
G = (W,A), where:
– W is the set of nodes, i.e. word tokens
in the input string, ordered by a linear
precedence relation <.
– A is a set of labeled arcs (wi,r,wj), wi,
wj ∈ W, r ∈ R.
• A dependency graph G = (W,A) is well-
formed iff it is acyclic and no node has an
in-degree greater than 1.
We will use the notation wi r→ wj to symbolize
that (wi,r,wj) ∈ A, where wi is referred to as
the head and wj as the dependent. We say that
an arc is projective iff, for every word wj occur-
ring between wi and wk (i.e., wi < wj < wk
or wi > wj > wk), there is a path from wi to
wj. A graph is projective iff all its arcs are pro-
jective. Figure 1 shows a well-formed (projective)
dependency graph for a sentence from the Prague
Dependency Treebank.
2.2 Coordination and Verb Groups
Dependency grammar assumes that syntactic
structure consists of lexical nodes linked by binary
dependencies. Dependency theories are thus best
suited for binary syntactic constructions, where
one element can clearly be distinguished as the
syntactic head. The analysis of coordination is
problematic in this respect, since it normally in-
volves at least one conjunction and two conjuncts.
The verb group, potentially consisting of a whole
chain of verb forms, is another type of construc-
tion where the syntactic relation between elements
is not clear-cut in dependency terms.
Several solutions have been proposed to the
problem of coordination. One alternative is
to avoid creating dependency relations between
the conjuncts, and instead let the conjuncts
have a direct dependency relation to the same
head (Tesni`ere, 1959; Hudson, 1990). Another
approach is to make the conjunction the head and
let the conjuncts depend on the conjunction. This
analysis, which appears well motivated on seman-
tic grounds, is adopted in the FGD framework and
will therefore be called Prague style (PS). It is
exemplified in figure 1, where the conjunction a
(and) is the head of the conjuncts bojovnost´ı and
tvrdost´ı. A different solution is to adopt a more
hierarchical analysis, where the conjunction de-
pends on the first conjunct, while the second con-
junct depends on the conjunction. In cases of
multiple coordination, this can be generalized to a
chain, where each element except the first depends
on the preceding one. This more syntactically
oriented approach has been advocated notably by
Mel’ˇcuk (1988) and will be called Mel’ˇcuk style
(MS). It is illustrated in figure 2, which shows a
transformed version of the dependency graph in
figure 1, where the elements of the coordination
form a chain with the first conjunct (bojovnost´ı) as
the topmost head. Lombardo and Lesmo (1998)
conjecture that MS is more suitable than PS for
incremental dependency parsing.
The difference between the more semantically
oriented PS and the more syntactically oriented
MS is seen also in the analysis of verb groups,
where the former treats the main verb as the head,
since it is the bearer of valency, while the latter
treats the auxiliary verb as the head, since it is the
finite element of the clause. Without questioning
the theoretical validity of either approach, we can
again ask which analysis is best suited to achieve
high accuracy in parsing.
2.3 PDT
PDT (Hajiˇc, 1998; Hajiˇc et al., 2001) consists of
1.5M words of newspaper text, annotated in three
layers: morphological, analytical and tectogram-
matical. In this paper, we are only concerned
with the analytical layer, which contains a surface-
syntactic dependency analysis, involving a set of
28 dependency types, and not restricted to projec-
tive dependency graphs.1 The annotation follows
FGD, which means that it involves a PS analysis of
both coordination and verb groups. Whether better
parsing accuracy can be obtained by transforming
1About 2% of all dependencies are non-projective and
about 25% of all sentences have a non-projective dependency
graph (Nivre and Nilsson, 2005).
258
(“The final of the tournament was distinguished by great fighting spirit and unexpected hardness”)
A7
Velkou
great
a63
Atr
N7
bojovnost´ı
fighting-spirit
a63
Obj Co
Jˆ
a
and
a63
Coord
A7
neˇcekanou
unexpected
a63
Atr
N7
tvrdost´ı
hardness
a63
Obj Co
P4
se
itself
a63
AuxT
Vp
vyznaˇcovalo
distinguished
N2
fin´ale
final
a63
Sb
N2
turnaje
of-the-tournament
a63
Atr
Figure 1: Dependency graph for a Czech sentence from the Prague Dependency Treebank
(“The final of the tournament was distinguished by great fighting spirit and unexpected hardness”)
A7
Velkou
great
a63
Atr
N7
bojovnost´ı
fighting-spirit
a63
Obj
Jˆ
a
and
a63
Coord
A7
neˇcekanou
unexpected
a63
Atr
N7
tvrdost´ı
hardness
a63
Obj
P4
se
itself
a63
AuxT
Vp
vyznaˇcovalo
distinguished
N2
fin´ale
final
a63
Sb
N2
turnaje
of-the-tournament
a63
Atr
Figure 2: Transformed dependency graph for a Czech sentence from the Prague Dependency Treebank
this to MS is one of the hypotheses explored in the
experimental study below.
2.4 MaltParser
MaltParser (Nivre and Hall, 2005; Nivre et al.,
2006) is a data-driven parser-generator, which can
induce a dependency parser from a treebank, and
which supports several parsing algorithms and
learning algorithms. In the experiments below we
use the algorithm of Nivre (2003), which con-
structs a labeled dependency graph in one left-
to-right pass over the input. Classifiers that pre-
dict the next parser action are constructed through
memory-based learning (MBL), using the TIMBL
software package (Daelemans and Van den Bosch,
2005), and support vector machines (SVM), using
LIBSVM (Chang and Lin, 2005).
2.5 Related Work
Other ways of improving parsing accuracy with
respect to coordination include learning patterns
of morphological and semantical information for
the conjuncts (Park and Cho, 2000). More specifi-
cally for PDT, Collins et al. (1999) relabel coordi-
nated phrases after converting dependency struc-
tures to phrase structures, and Zeman (2004) uses
a kind of pattern matching, based on frequencies
of the parts-of-speech of conjuncts and conjunc-
tions. Zeman also mentions experiments to trans-
form the dependency structure for coordination
but does not present any results.
Graph transformations in dependency parsing
have also been used in order to recover non-
projective dependencies together with parsers that
are restricted to projective dependency graphs.
Thus, Nivre and Nilsson (2005) improve parsing
accuracy for MaltParser by projectivizing training
data and applying an inverse transformation to the
output of the parser, while Hall and Nov´ak (2005)
apply post-processing to the output of Charniak’s
parser (Charniak, 2000). In the final experi-
ments below, we combine these techniques with
the transformations investigated in this paper.
3 Dependency Graph Transformations
In this section, we describe algorithms for trans-
forming dependency graphs in PDT from PS to
MS and back, starting with coordination and con-
tinuing with verb groups.
3.1 Coordination
The PS-to-MS transformation for coordination
will be designated τc(∆), where ∆ is a data set.
The transformation begins with the identification
of a base conjunction, based on its dependency
type (Coord) and/or its part-of-speech (Jˆ). For
example, the word a (and) in figure 1 is identified
as a base conjunction.
259
Before the actual transformation, the base con-
junction and all its dependents need to be classi-
fied into three different categories. First, the base
conjunction is categorized as a separator (S). If
the coordination consists of more than two con-
juncts, it normally has one or more commas sep-
arating conjuncts, in addition to the base conjunc-
tion. These are identified by looking at their de-
pendency type (mostly AuxX) and are also catego-
rized as S. The coordination in figure 1 contains
no commas, so only the word a will belong to S.
The remaining dependents of the base conjunc-
tion need to be divided into conjuncts (C) and
other dependents (D). To make this distinction,
the algorithm again looks at the dependency type.
In principle, the dependency type of a conjunct
has the suffix Co, although special care has to be
taken for coordinated prepositional cases and em-
bedded clauses (B¨ohmov´a et al., 2003). The words
bojovnost´ı and tvrdost´ı in figure 1, both having the
dependency type Obj Co, belong to the category
C. Since there are no other dependents of a, the
coordination contains no instances of the category
D.
Given this classification of the words involved
in a coordination, the transformation τc(∆) is
straightforward and basically connects all the arcs
in a chain. Let C1,...,Cn be the elements of C,
ordered by linear precedence, and let S1i,...,Smi
be the separators occurring between Ci and Ci+1.
Then every Ci becomes the head of S1i,...,Smi,
Smi becomes the head of Ci+1, and C1 becomes
the only dependent of the original head of the base
conjunction. The dependency types of the con-
juncts are truncated by removing the suffix Co.2
Also, each word in wd ∈ D becomes a dependent
of the conjunct closest to its left, and if such a word
does not exist, wd will depend on the leftmost con-
junct. After the transformation τc(∆), every coor-
dination forms a left-headed chain, as illustrated
in figure 2.
This new representation creates a problem,
however. It is no longer possible to distinguish the
dependents in D from other dependents of the con-
juncts. For example, the word Velkou in figure 2
is not distinguishable from a possible dependent
in D, which is an obvious drawback when trans-
forming back to PS. One way of distinguishing D
elements is to extend the set of dependency types.
2Preliminary results indicated that this increases parsing
accuracy.
The dependency type r of each wd ∈ D can be re-
placed by a completely new dependency type r+
(e.g., Atr+), theoretically increasing the number
of dependency types to 2· |R|.
The inverse transformation, τ−1c (∆), again
starts by identifying base conjunctions, using the
same conditions as before. For each identified
base conjunction, it calls a procedure that per-
forms the inverse transformation by traversing
the chain of conjuncts and separators “upwards”
(right-to-left), collecting conjuncts (C), separators
(S) and potential conjunction dependents (Dpot).
When this is done, the former head of the left-
most conjunct (C1) becomes the head of the right-
most (base) conjunction (Smn−1). In figure 2,
the leftmost conjunct is bojovnost´ı, with the head
vyznaˇcovalo, and the rightmost (and only) con-
junction is a, which will then have vyznaˇcovalo as
its new head. All conjuncts in the chain become
dependents of the rightmost conjunction, which
means that the structure is converted back to the
one depicted in figure 1.
As mentioned above, the original structure in
figure 1 did not have any coordination dependents,
but Velkou ∈ Dpot. The last step of the inverse
transformation is therefore to sort out conjunction
dependents from conjunct dependents, where the
former will attach to the base conjunction. Four
versions have been implemented, two of which
take into account the fact that the dependency
types AuxG, AuxX, AuxY, and Pred are the only
dependency types that are more frequent as con-
junction dependents (D) than as conjunct depen-
dents in the training data set:
• τc: Do not extend arc labels in τc. Leave all
words in Dpot in place in τ−1c .
• τc∗: Do not extend arc labels in τc. Attach all
words with label AuxG, AuxX, AuxY or Pred
to the base conjunction in τ−1c .
• τc+: Extend arc labels from r to r+ for D
elements in τc. Attach all words with label
r+ to the base conjunction (and change the
label to r) in τ−1c .
• τc+∗: Extend arc labels from r to r+ for D
elements in τc, except for the labels AuxG,
AuxX, AuxY and Pred. Attach all words with
label r+, AuxG, AuxX, AuxY, or Pred to the
base conjunction (and change the label to r if
necessary) in τ−1c .
260
3.2 Verb Groups
To transform verb groups from PS to MS, the
transformation algorithm, τv(∆), starts by identi-
fying all auxiliary verbs in a sentence. These will
belong to the set A and are processed from left to
right. A word waux ∈ A iff wmain AuxV−→ waux,
where wmain is the main verb. The transformation
into MS reverses the relation between the verbs,
i.e., waux AuxV−→ wmain, and the former head of
wmain becomes the new head of waux. The main
verb can be located on either side of the auxiliary
verb and can have other dependents (whereas aux-
iliary verbs never have dependents), which means
that dependency relations to other dependents of
wmain may become non-projective through the
transformation. To avoid this, all dependents to
the left of the rightmost verb will depend on the
leftmost verb, whereas the others will depend on
the rightmost verb.
Performing the inverse transformation for verb
groups, τ−1v (∆), is quite simple and essentially
the same procedure inverted. Each sentence is tra-
versed from right to left looking for arcs of the
type waux AuxV−→ wmain. For every such arc, the
head of waux will be the new head of wmain, and
wmain the new head of waux. Furthermore, since
waux does not have dependents in PS, all depen-
dents of waux in MS will become dependents of
wmain in PS.
4 Experiments
All experiments are based on PDT 1.0, which is
divided into three data sets, a training set (∆t), a
development test set (∆d), and an evaluation test
set (∆e). Table 1 shows the size of each data set, as
well as the relative frequency of the specific con-
structions that are in focus here. Only 1.3% of all
words in the training data are identified as auxil-
iary verbs (A), whereas coordination (S and C)
is more common in PDT. This implies that coor-
dination transformations are more likely to have
a greater impact on overall accuracy compared to
the verb group transformations.
In the parsing experiments reported in sections
4.1–4.2, we use ∆t for training, ∆d for tuning, and
∆e for the final evaluation. The part-of-speech
tagging used (both in training and testing) is the
HMM tagging distributed with the treebank, with
a tagging accuracy of 94.1%, and with the tagset
compressed to 61 tags as in Collins et al. (1999).
Data #S #W %S %C %A
∆t 73088 1256k 3.9 7.7 1.3
∆d 7319 126k 4.0 7.8 1.4
∆e 7507 126k 3.8 7.3 1.4
Table 1: PDT data sets; S = sentence, W = word;
S = separator, C = conjunct, A = auxiliary verb
T AS
τc 97.8
τc∗ 98.6
τc+ 99.6
τc+∗ 99.4
τv 100.0
Table 2: Transformations; T = transformation;
AS = attachment score (unlabeled) of τ−1(τ(∆t))
compared to ∆t
MaltParser is used with the parsing algorithm of
Nivre (2003) together with the feature model used
for parsing Czech by Nivre and Nilsson (2005).
In section 4.2 we use MBL, again with the same
settings as Nivre and Nilsson (2005),3 and in sec-
tion 4.2 we use SVM with a polynomial kernel of
degree 2.4 The metrics for evaluation are the at-
tachment score (AS) (labeled and unlabeled), i.e.,
the proportion of words that are assigned the cor-
rect head, and the exact match (EM) score (labeled
and unlabeled), i.e., the proportion of sentences
that are assigned a completely correct analysis.
All tokens, including punctuation, are included in
the evaluation scores. Statistical significance is as-
sessed using McNemar’s test.
4.1 Experiment 1: Transformations
The algorithms are fairly simple. In addition, there
will always be a small proportion of syntactic con-
structions that do not follow the expected pattern.
Hence, the transformation and inverse transforma-
tion will inevitably result in some distortion. In
order to estimate the expected reduction in pars-
ing accuracy due to this distortion, we first con-
sider a pure treebank transformation experiment,
where we compare τ−1(τ(∆t)) to ∆t, for all the
different transformations τ defined in the previous
section. The results are shown in table 2.
We see that, even though coordination is more
frequent, verb groups are easier to handle.5 The
3TIMBL parameters: -k5 -mM -L3 -w0 -dID.
4LIBSVM parameters: -s0 -t1 -d2 -g0.12 -r0 -c1 -e0.1.
5The result is rounded to 100.0% but the transformed tree-
261
coordination version with the least loss of infor-
mation (τc+) fails to recover the correct head for
0.4% of all words in ∆t.
The difference between τc+ and τc is expected.
However, in the next section this will be contrasted
with the increased burden on the parser for τc+,
since it is also responsible for selecting the correct
dependency type for each arc among as many as
2· |R| types instead of |R|.
4.2 Experiment 2: Parsing
Parsing experiments are carried out in four steps
(for a given transformation τ):
1. Transform the training data set into τ(∆t).
2. Train a parser p on τ(∆t).
3. Parse a test set ∆ using p with output p(∆).
4. Transform the parser output into τ−1(p(∆)).
Table 3 presents the results for a selection of trans-
formations using MaltParser with MBL, tested on
the evaluation test set ∆e with the untransformed
data as baseline. Rows 2–5 show that transform-
ing coordinate structures to MS improves parsing
accuracy compared to the baseline, regardless of
which transformation and inverse transformation
are used. Moreover, the parser benefits from the
verb group transformation, as seen in row 6.
The final row shows the best combination of a
coordination transformation with the verb group
transformation, which amounts to an improvement
of roughly two percentage points, or a ten percent
overall error reduction, for unlabeled accuracy.
All improvements over the baseline are statis-
tically significant (McNemar’s test) with respect
to attachment score (labeled and unlabeled) and
unlabeled exact match, with p < 0.01 except for
the unlabeled exact match score of the verb group
transformation, where 0.01 < p < 0.05. For the
labeled exact match, no differences are significant.
The experimental results indicate that MS is
more suitable than PS as the target representation
for deterministic data-driven dependency parsing.
A relevant question is of course why this is the
case. A partial explanation may be found in the
“short-dependency preference” exhibited by most
parsers (Eisner and Smith, 2005), with MaltParser
being no exception. The first row of table 4 shows
the accuracy of the parser for different arc lengths
under the baseline condition (i.e., with no trans-
formations). We see that it performs very well on
bank contains 19 erroneous heads.
AS EM
T U L U L
None 79.08 72.83 28.99 21.15
τc 80.55 74.06 30.08 21.27
τc∗ 80.90 74.41 30.56 21.42
τc+ 80.58 74.07 30.42 21.17
τc+∗ 80.87 74.36 30.89 21.38
τv 79.28 72.97 29.53 21.38
τv◦τc+∗ 81.01 74.51 31.02 21.57
Table 3: Parsing accuracy (MBL, ∆e); T = trans-
formation; AS = attachment score, EM = exact
match; U = unlabeled, L = labeled
AS ∆e 90.1 83.6 70.5 59.5 45.9
Length: 1 2-3 4-6 7-10 11-
∆t 51.9 29.4 11.2 4.4 3.0
τc(∆t) 54.1 29.1 10.7 3.8 2.4
τv(∆t) 52.9 29.2 10.7 4.2 2.9
Table 4: Baseline labeled AS per arc length on ∆e
(row 1); proportion of arcs per arc length in ∆t
(rows 3–5)
short arcs, but that accuracy drops quite rapidly
as the arcs get longer. This can be related to the
mean arc length in ∆t, which is 2.59 in the un-
transformed version, 2.40 in τc(∆t) and 2.54 in
τv(∆t). Rows 3-5 in table 4 show the distribution
of arcs for different arc lengths in different ver-
sions of the data set. Both τc and τv make arcs
shorter on average, which may facilitate the task
for the parser.
Another possible explanation is that learning is
facilitated if similar constructions are represented
similarly. For instance, it is probable that learning
is made more difficult when a unit has different
heads depending on whether it is part of a coordi-
nation or not.
4.3 Experiment 3: Optimization
In this section we combine the best results from
the previous section with the graph transforma-
tions proposed by Nivre and Nilsson (2005) to re-
cover non-projective dependencies. We write τp
for the projectivization of training data and τ−1p for
the inverse transformation applied to the parser’s
output.6 In addition, we replace MBL with SVM,
a learning algorithm that tends to give higher accu-
racy in classifier-based parsing although it is more
6More precisely, we use the variant called PATH in Nivre
and Nilsson (2005).
262
AS EM
T LA U L U L
None MBL 79.08 72.83 28.99 21.15
τp MBL 80.79 74.39 31.54 22.53
τp◦τv◦τc+∗ MBL 82.93 76.31 34.17 23.01
None SVM 81.09 75.68 32.24 25.02
τp SVM 82.93 77.28 35.99 27.05
τp◦τv◦τc+∗ SVM 84.55 78.82 37.63 27.69
Table 5: Optimized parsing results (SVM, ∆e); T = transformation; LA = learning algorithm; AS =
attachment score, EM = exact match; U = unlabeled, L = labeled
T P:S R:S P:C R:C P:A R:A P:M R:M
None 52.63 72.35 55.15 67.03 82.17 82.21 69.95 69.07
τp◦τv◦τc+∗ 63.73 82.10 63.20 75.14 90.89 92.79 80.02 81.40
Table 6: Detailed results for SVM; T = transformation; P = unlabeled precision, R = unlabeled recall
costly to train (Sagae and Lavie, 2005).
Table 5 shows the results, for both MBL and
SVM, of the baseline, the pure pseudo-projective
parsing, and the combination of pseudo-projective
parsing with PS-to-MS transformations. We see
that pseudo-projective parsing brings a very con-
sistent increase in accuracy of at least 1.5 percent-
age points, which is more than that reported by
Nivre and Nilsson (2005), and that the addition
of the PS-to-MS transformations increases accu-
racy with about the same margin. We also see that
SVM outperforms MBL by about two percentage
points across the board, and that the positive effect
of the graph transformations is most pronounced
for the unlabeled exact match score, where the
improvement is more than five percentage points
overall for both MBL and SVM.
Table 6 gives a more detailed analysis of the
parsing results for SVM, comparing the optimal
parser to the baseline, and considering specifically
the (unlabeled) precision and recall of the cate-
gories involved in coordination (separators S and
conjuncts C) and verb groups (auxiliary verbs A
and main verbs M). All figures indicate, with-
out exception, that the transformations result in
higher precision and recall for all directly involved
words. (All differences are significant beyond the
0.01 level.) It is worth noting that the error reduc-
tion is actually higher for A and M than for S and
C, although the former are less frequent.
With respect to unlabeled attachment score, the
results of the optimized parser are slightly below
the best published results for a single parser. Hall
and Nov´ak (2005) report a score of 85.1%, apply-
ing a corrective model to the output of Charniak’s
parser; McDonald and Pereira (2006) achieve a
score of 85.2% using a second-order spanning tree
algorithm. Using ensemble methods and a pool of
different parsers, Zeman and ˇZabokrtsk´y (2005)
attain a top score of 87.0%. For unlabeled exact
match, our results are better than any previously
reported results, including those of McDonald and
Pereira (2006). (For the labeled scores, we are not
aware of any comparable results in the literature.)
5 Conclusion
The results presented in this paper confirm that
choosing the right representation is important
in parsing. By systematically transforming the
representation of coordinate structures and verb
groups in PDT, we achieve a 10% error reduc-
tion for a data-driven dependency parser. Adding
graph transformations for non-projective depen-
dency parsing gives a total error reduction of
about 20% (even more for unlabeled exact match).
In this way, we achieve state-of-the-art accuracy
with a deterministic, classifier-based dependency
parser.
Acknowledgements
The research presented in this paper was partially
supported by the Swedish Research Council. We
are grateful to Jan Hajiˇc and Daniel Zeman for
help with the Czech data and to three anonymous
reviewers for helpful comments and suggestions.
263
References
Daniel M. Bikel. 2004. Intricacies of Collins’ parsing
model. Computational Linguistics, 30:479–511.
Alena B¨ohmov´a, Jan Hajiˇc, Eva Hajiˇcov´a, and Barbora
Hladk´a. 2003. The Prague Dependency Treebank:
A three-level annotation scenario. In Anne Abeill´e,
editor, Treebanks: Building and Using Syntactically
Annotated Corpora. Kluwer Academic Publishers.
Chih-Chung Chang and Chih-Jen Lin. 2005. LIB-
SVM: A library for support vector machines.
Eugene Charniak. 2000. A maximum-entropy-
inspired parser. In Proceedings of the First Meet-
ing of the North American Chapter of the Associa-
tion for Computational Linguistics (NAACL), pages
132–139.
Michael Collins, Jan Hajiˇc, Eric Brill, Lance Ramshaw,
and Christoph Tillmann. 1999. A statistical parser
for Czech. In Proceedings of the 37th Annual Meet-
ing of the Association for Computational Linguistics
(ACL), pages 505–512.
Michael Collins. 1997. Three generative, lexicalised
models for statistical parsing. In Proceedings of the
35th Annatual Meeting of the Association for Com-
putational Linguistics (ACL), pages 16–23.
Michael Collins. 1999. Head-Driven Statistical Mod-
els for Natural Language Parsing. Ph.D. thesis,
University of Pennsylvania.
Walter Daelemans and Antal Van den Bosch. 2005.
Memory-Based Language Processing. Cambridge
University Press.
Jason Eisner and Noah A. Smith. 2005. Parsing with
soft and hard constraints on dependency length. In
Proceedings of the 9th International Workshop on
Parsing Technologies (IWPT).
Jan Hajiˇc, Barbora Vidova Hladka, Jarmila Panevov´a,
Eva Hajiˇcov´a, Petr Sgall, and Petr Pajas. 2001.
Prague Dependency Treebank 1.0. LDC, 2001T10.
Jan Hajiˇc. 1998. Building a Syntactically Annotated
Corpus: The Prague Dependency Treebank. In Is-
sues of Valency and Meaning, pages 12–19. Prague
Karolinum, Charles University Press.
Keith Hall and Vaclav Nov´ak. 2005. Corrective mod-
eling for non-projective dependency parsing. In
Proceedings of the 9th International Workshop on
Parsing Technologies (IWPT).
Richard Hudson. 1990. English Word Grammar. Basil
Blackwell.
Mark Johnson. 1998. Pcfg models of linguistic
tree representations. Computational Linguistics,
24:613–632.
Dan Klein and Christopher Manning. 2003. Accu-
rate unlexicalized parsing. In Proceedings of the
41st Annual Meeting of the Association for Compu-
tational Linguistics (ACL), pages 423–430.
Vincenzo Lombardo and Leonardo Lesmo. 1998.
Unit coordination and gapping in dependency the-
ory. In Proceedings of the Workshop on Processing
of Dependency-Based Grammars, pages 11–20.
Ryan McDonald and Fernando Pereira. 2006. Online
learning of approximate dependency parsing algo-
rithms. In Proceedings of the 11th Conference of
the European Chapter of the Association for Com-
putational Linguistics (EACL).
Igor Mel’cuk. 1988. Dependency Syntax: Theory and
Practice. State University of New York Press.
Joakim Nivre and Johan Hall. 2005. MaltParser: A
language-independent system for data-driven depen-
dency parsing. In Proceedings of the Fourth Work-
shop on Treebanks and Linguistic Theories (TLT),
pages 137–148.
Joakim Nivre and Jens Nilsson. 2005. Pseudo-
projective dependency parsing. In Proceedings of
the 43rd Annual Meeting of the Association for
Computational Linguistics (ACL), pages 99–106.
Joakim Nivre, Johan Hall, and Jens Nilsson. 2006.
MaltParser: A data-driven parser-generator for de-
pendency parsing. In Proceedings of the 5th In-
ternational Conference on Language Resources and
Evaluation.
Joakim Nivre. 2003. An efficient algorithm for pro-
jective dependency parsing. In Proceedings of the
8th International Workshop on Parsing Technologies
(IWPT), pages 149–160.
Jong C. Park and Hyung Joon Cho. 2000. Informed
parsing for coordination with combinatory catego-
rial grammar. In Proceedings of the 18th Inter-
national Conference on Computational Linguistics
(COLING), pages 593–599.
Kenji Sagae and Alon Lavie. 2005. A classifier-based
parser with linear run-time complexity. In Proceed-
ings of the 9th International Workshop on Parsing
Technologies (IWPT), pages 125–132.
Petr Sgall, Eva Hajiˇcov´a, and Jarmila Panevov´a. 1986.
The Meaning of the Sentence in Its Pragmatic As-
pects. Reidel.
Lucien Tesni`ere. 1959. ´El´ements de syntaxe struc-
turale. Editions Klincksieck.
Daniel Zeman and Zdenˇek ˇZabokrtsk´y. 2005. Improv-
ing parsing accuracy by combining diverse depen-
dency parsers. In Proceedings of the 9th Interna-
tional Workshop on Parsing Technologies (IWPT).
Daniel Zeman. 2004. Parsing with a Statistical De-
pendency Model. Ph.D. thesis, Charles University.
264
