Proceedings of the 10th Conference on Computational Natural Language Learning (CoNLL-X),
pages 231–235, New York City, June 2006. c©2006 Association for Computational Linguistics
Language Independent Probabilistic Context-Free Parsing
Bolstered by Machine Learning
Michael Schiehlen Kristina Spranger
Institute for Computational Linguistics
University of Stuttgart
D-70174 Stuttgart
Michael.Schiehlen@ims.uni-stuttgart.de
Kristina.Spranger@ims.uni-stuttgart.de
Abstract
Unlexicalized probabilistic context-free
parsing is a general and flexible approach
that sometimes reaches competitive re-
sults in multilingual dependency parsing
even if a minimum of language-specific
information is supplied. Furthermore, in-
tegrating parser results (good at long de-
pendencies) and tagger results (good at
short range dependencies, and more easily
adaptable to treebank peculiarities) gives
competitive results in all languages.
1 Introduction
Unlexicalized probabilistic context-free parsing is
a simple and flexible approach that nevertheless
has shown good performance (Klein and Manning,
2003). We applied this approach to the shared task
(Buchholz et al., 2006) for Arabic (Hajiˇc et al.,
2004), Chinese (Chen et al., 2003), Czech (Böh-
mová et al., 2003), Danish (Kromann, 2003), Dutch
(van der Beek et al., 2002), German (Brants et al.,
2002), Japanese (Kawata and Bartels, 2000), Por-
tuguese (Afonso et al., 2002), Slovene (Džeroski et
al., 2006), Spanish (Civit Torruella and Martí An-
tonín, 2002), Swedish (Nilsson et al., 2005), Turk-
ish (Oflazer et al., 2003; Atalay et al., 2003), but
not Bulgarian (Simov et al., 2005). In our ap-
proach we put special emphasis on language inde-
pendence: We did not use any extraneous knowl-
edge; we did not do any transformations on the
treebanks; we restricted language-specific parame-
ters to a small, easily manageable set (a classifica-
tion of dependency relations into complements, ad-
juncts, and conjuncts/coordinators, and a switch for
Japanese to include coarse POS tag information, see
section 3.4). In a series of post-submission experi-
ments, we investigated how much the parse results
can help a machine learner.
2 Experimental Setup
For development, we chose the initial a0 sentences of
every treebank, where a0 is the number of the sen-
tences in the test set. In this way, the sizes were
realistic for the task. For parsing the test data, we
added the development set to the training set.
All the evaluations on the test sets were performed
with the evaluation script supplied by the conference
organizers. For development, we used labelled F-
score computed from all tokens except the ones em-
ployed for punctuation (cf. section 3.2).
3 Context Free Parsing
3.1 The Parser
Basically, we investigated the performance of a
straightforward unlexicalized statistical parser, viz.
BitPar (Schmid, 2004). BitPar is a CKY parser that
uses bit vectors for efficient representation of the
chart and its items. If frequencies for the grammat-
ical and lexical rules in a training set are available,
BitPar uses the Viterbi algorithm to extract the most
probable parse tree (according to PCFG) from the
chart.
231
3.2 Converting Dependency Structure to
Constituency Structure
In order to determine the grammar rules required by
the context-free parser, the dependency trees in the
CONLL format have to be converted to constituency
trees. Gaifman (1965) proved that projective de-
pendency grammars can be mapped to context-free
grammars. The main information that needs to be
added in going from dependency to constituency
structure is the category of non-terminals. The usage
of special knowledge bases to determine projections
of categories (Xia and Palmer, 2001) would have
presupposed language-dependent knowledge, so we
investigated two other options: Flat rules (Collins
et al., 1999) and binary rules. In the flat rules ap-
proach, each lexical category projects to exactly one
phrasal category, and every projection chain has a
length of at most one. The binary rules approach
makes use of the X-bar-scheme and thus introduces
along with the phrasal category an intermediate cate-
gory. The phrasal category must not occur more than
once in a projection chain, and a projection chain
must not end in an intermediate category. In both ap-
proaches, projection is only triggered if dependents
are present; in case a category occurs as a depen-
dent itself, no projection is required. In coordination
structures, the parent category is copied from that of
the last conjunct.
Non-projective relations can be treated as un-
bounded dependencies so that their surface posi-
tion (antecedent position) is related to the position
of their head (trace position) with an explicit co-
indexed trace (like in the Penn treebank). To find
the position of trace and antecedent we assume three
constraints: The antecedent should c-command its
trace. The antecedent is maximally near to the trace
in depth of embedding. The trace is maximally near
to the antecedent in surface order.
Finally the placement of punctuation signs has
a major impact on the performance of a parser
(Collins et al., 1999). In most of the treebanks, not
much effort is invested into the treatment of punc-
tuation. Sometimes, punctuation signs play a role
in predicate-argument structure (commas acting as
coordinators), but more often they do not, in which
case they are marked by special roles (e.g. “pnct”,
“punct”, “PUNC”, or “PUNCT”). We used a general
mechanism to re-insert such signs, for all languages
but CH (no punctuation signs) and AR, CZ, SL (re-
liable annotation). Correct placement of punctua-
tion presupposes knowledge of the punctuation rules
valid in a language. In the interest of generality, we
opted for a suboptimal solution: Punctuation signs
are inserted in the highest possible position in a tree.
3.3 Subcategorization and Coordination
The most important language-specific information
that we made use of was a classification of de-
pendency relations into complements, coordina-
tors/conjuncts, and other relations (adjuncts).
Given knowledge about complement relations, it
is fairly easy to construct subcategorization frames
for word occurrences: A subcategorization frame is
simply the set of the complement relations by which
dependents are attached to the word. To give the
parser access to these lists, we annotated the cate-
gory of a subcategorizing word with its subcatego-
rization frame. In this way, the parser can learn to as-
sociate the subcategorization requirements of a word
with its local syntactic context (Schiehlen, 2004).
Coordination constructions are marked either in
the conjuncts (CH, CZ, DA, DU, GE, PO, SW) or
the coordinator (AR, SL). If conjuncts show coordi-
nation, a common representation of asyndetic coor-
dination has one conjunct point to another conjunct.
It is therefore important to distinguish coordinators
from conjuncts. Coordinators are either singled out
by special dependency relations (DA, PO, SW) or by
their POS tags (CH, DU). In German, the first con-
junct phrase is merged with the whole coordinated
phrase (due to a conversion error?) so that determin-
ing the coordinator as a head is not possible.
We also experimented with attaching the POS
tags of heads to the categories of their adjunct de-
pendents. In this way, the parser could differenti-
ate between e.g. verbal and nominal adjuncts. In
our experiments, the performance gains achieved by
this strategy were low, so we did not incorporate it
into the system. Possibly, better results could be
achieved by restricting annotation to special classes
of adjuncts or by generalizing the heads’ POS tags.
3.4 Categories
As the treebanks provide a lot of information with
every word token, it is a delicate question to de-
232
Ch Da Du Ge Ja Po Sp Tu
coarse POS 72.99 69.38 69.27 – 79.07 66.09
fine POS 61.21 69.78 67.72 7.40 73.44 71.75 54.96
POS + feat – 42.67 40.40 –
dep-rel 76.61 72.77 70.70 70.31 78.12 72.93 66.93 65.03
coarse + dep-rel 77.61 67.56 69.43 – 81.36 64.03
fine + dep-rel 51.21 57.72 68.55 46.28 36.59 54.97
Figure 1: Types of Categories (Development Results)
cide on the type and granularity of the information to
use in the categories of the grammar. The treebanks
specify for every word a (fine-grained) POS tag, a
coarse-grained POS tag, a collection of morphosyn-
tactic features, and a dependency relation (dep-rel).
Only the dependency relation is really orthogonal;
the other slots contain various generalizations of the
same morphological information. We tested sev-
eral options: coarse-grained POS tag (if available),
fine-grained POS tag, fine-grained POS tag with
morphosyntactic features (if available), name of de-
pendency relation, and the combinations of coarse-
grained or fine-grained POS tags with the depen-
dency relation.
Figure 1 shows F-score results on the develop-
ment set for several languages and different com-
binations. The best overall performer is dep-rel;
this somewhat astonishing fact may be due to the
superior quality of the annotations in this slot (de-
pendency relations were annotated by hand, POS
tags automatically). Furthermore, being checked in
evaluation, dependency relations directly affect per-
formance. Since we wanted a general language-
independent strategy, we used always the dep-rel
tags but for Japanese. The Japanese treebank fea-
tures only 8 different dependency relations, so we
added coarse-grained POS tag information. In the
categories for Czech, we deleted the suffixes mark-
ing coordination, apposition and parenthesis (Co,
Ap, Pa), reducing the number of categories roughly
by a factor of four. In coordination, conjuncts inherit
the dep-rel category from the parent.
Whereas the dep-rel information is submitted to
the parser directly in terms of the categories, the
information in the lemma, POS tag and morpho-
syntactic features slot was used only for back-off
smoothing when associating lexical items with cate-
Cz Ge Sp Sw
dep-rel 52.66 70.31 66.93 72.91
new classific 58.92 74.32 66.09 61.59
new + dep-rel 56.94 78.40 64.03 66.32
Figure 4: Manual POS Tag Classes (Development)
gories. A grammar with this configuration was used
to produce the results submitted (cf. line labelled CF
in Figures 2 and 3).
Instead of using the category generalizations sup-
plied with the treebanks directly, manual labour can
be put into discovering classifications that behave
better for the purposes of statistical parsing. So,
Collins et al. (1999) proposed a tag classification
for parsing the Czech treebank. We also investi-
gated a classification for German1, as well as one for
Swedish and one for Spanish, which were modelled
after the German classification. The results in Fig-
ure 4 show that new classifications may have a dra-
matic effect on performance if the treebank is suf-
ficiently large. In the interest of generality, we did
not make use of the language dependent tag classifi-
cations for the results submitted, but we will never-
theless report results that could have been achieved
with these classifications.
3.5 Markovization
Another strategy that is often used in statistical pars-
ing is Markovization (Collins, 1999): Treebanks
1punctuation {$( $” $, $.} adjectives {ADJA ADJD CARD}
adverbs {ADV PROAV PTKA PTKNEG PTKVZ PWAV}
prepositions {APPR APPO APZR APPRART KOKOM} nouns
{NN NE NNE PDS PIS PPER PPOSS PRELS PRF PWS
SYM} determiners {ART PDAT PIAT PRELAT PPOSAT
PWAT} verb forms {VAFIN VMFIN VVFIN} {VAIMP
VVIMP} {VAINF VMINF VVINF} {VAPP VMPP VVPP}
{VVIZU PTKZU} clause-like items {ITJ PTKANT KOUS}
233
Ar Ch Cz Da Du Ge Ja Po Sl Sp Sw Tu Bu
Best 66.91 89.96 80.18 84.79 79.19 87.34 91.65 87.60 73.44 82.25 84.58 65.68 87.57
Average 59.94 78.32 67.17 76.16 70.73 78.58 85.86 80.63 65.16 73.52 76.44 55.95 79.98
CF (submitted) 44.39 66.20 53.34 76.05 72.11 68.73 83.35 71.01 50.72 46.96 71.10 49.81 –
MaxEnt 59.16 61.65 63.28 73.25 64.47 73.94 82.79 80.30 66.27 69.73 72.99 47.16 –
combined 61.82 73.34 71.74 79.64 75.51 80.75 88.15 82.43 67.09 71.15 76.88 53.65 –
CF+Markov 45.37 70.76 55.14 74.49 72.55 68.87 84.57 71.89 55.16 47.95 71.18 51.64 –
CFM+newcl 73.84 62.10 77.76 49.61 –
combined 76.84 72.76 82.59 69.38 72.57 –
new rules (in %) 7.15 6.03 4.64 7.34 5.03 7.42 5.59 6.69 21.00 9.50 10.14 14.23
Figure 2: Labelled Accuracy Results on the Test Sets
Ar Ch Cz Da Du Ge Ja Po Sl Sp Sw Tu
CF 41.91 76.61 52.66 72.77 70.69 70.31 81.36 72.76 49.00 66.93 72.91 65.03
CF+Markov 63.00 80.25 52.80 73.31 70.70 70.51 82.59 74.37 52.43 67.81 73.56 82.80
CFM+newcl 83.07 59.03 80.42 69.30
Figure 3: F Score Results on the Development Sets
usually contain very many long rules of low fre-
quency (presumably because inserting nodes costs
annotators time). Such rules cannot have an impact
in a statistical system (the line new-rules in Figure 2
shows the percentage of rules in the test set that are
not in the training set); it is better to view them as
products of a Markov process that chooses first the
head, then the symbols left of the head and finally
the symbols right of the hand. In a bigram model, the
choice of left and right siblings is made dependent
not only on the parent and head category, but also on
the last sibling on the left or right, respectively. For-
mally the probability of a rule with left hand side a0
and right hand side a1a3a2a5a4a6a4a6a4a7a1a9a8a11a10a13a12a14a8a15a4a6a4a6a4a16a12a18a17 is bro-
ken down to the product of the probability a19a21a20a23a22a24a10a26a25a0a28a27
of the head, the probabilities of the left siblings
a19a30a29a31a22a24a1a33a32a34a25a1a33a32a36a35a15a8a6a37a11a10a13a37
a0a28a27 and those of the right siblings
a19a39a38a40a22a24a12 a32 a25a12 a32a41a35a15a8 a37a11a10a13a37
a0a28a27 . Generic symbols designate be-
ginning (a1a3a42a40a37a11a12a43a42 ) and end (a1a21a2a45a44a39a8a6a37a11a12a18a17a46a44a39a8 ) of the sib-
ling lists. The method can be transferred to plain
unlexicalized PCFG (Klein and Manning, 2003) by
transforming long rules into a series of binary rules:
a0a48a47
a1 a2a50a49
a0
a37a11a10a51a37a11a1 a2 a37a11a1 a2a52a35a15a8a7a53
a49
a0
a37a11a10a13a37a11a1a33a32a54a44a39a8a55a37a11a1a21a32 a53
a47
a1a33a32 a49
a0
a37a11a10a13a37a11a1a33a32a56a37a11a1a33a32a36a35a15a8 a53
a49
a0
a37a11a10a13a37a11a1a9a8a57a37a11a1a21a42 a53
a47a59a58a60a0
a37a11a10a13a37a11a12a18a2a61a37a11a12a18a2a62a35a15a8a56a63a14a12a18a2
a58a60a0
a37a11a10a13a37a11a12a18a32a54a44a39a8a64a37a11a12a18a32a65a63
a47a66a58a60a0
a37a11a10a51a37a11a12a18a32a67a37a11a12a18a32a36a35a15a8a56a63a14a12a18a32
a58a60a0
a37a11a10a13a37a11a12a14a8a64a37a11a12a43a42a7a63
a47
a10
If the bigram symbols a58a60a0 a37a11a10a51a37a11a12 a32 a37a11a12 a32a36a35a15a8 a63 and
a49
a0
a37a11a10a13a37a11a1a33a32a56a37a11a1a21a32a41a35a15a8 a53 occur in less than a certain number
of rules (50 in our case), we smooth to unigram
symbols instead (a58a60a0 a37a11a10a51a37a11a12a68a32a24a63 and a49 a0 a37a11a10a13a37a11a1a21a32 a53 ). We
used a script of Schmid (2006) to Markovize
infrequent rules in this manner (i.e. all rules with
less than 50 occurrences that are not coordination
rules).
For time reasons, Markovization was not taken
into account in the submitted results. We refer to
Figures 2 and 3 (line labelled CF+Markov) for a list-
ing of the results attainable by Markovization on the
individual treebanks. Performance gains are even
more dramatic if in addition dependency relations +
manual POS tag classes are used as categories (line
labelled CFM+newcl in Figures 2 and 3).
3.6 From Constituency Structure Back to
Dependency Structure
In a last step, we converted the constituent trees back
to dependency trees, using the algorithm of Gaifman
(1965). Special provisos were necessary for the root
node, for which no head is given in certain treebanks
(Džeroski et al., 2006). To interpret the context-free
rules, we associated their children with dependency
relations. This information was kept in a separate
file that was invisible to the parser. In cases there
were several possible interpretations for a context
234
free rule, we always chose the most frequent one in
the training data (Schiehlen, 2004).
4 Machine Learning
While the results coming from the statistical parser
are not really competitive, we believe that they nev-
ertheless present valuable information for a machine
learner. To give some substance to this claim, we
undertook experiments with the Zhang Le’s Max-
Ent Toolkit2. For this work, we recast the depen-
dency parsing problem as a classification problem:
Given some feature information on the word to-
ken, in which dependency relations does it stand
to which head? While the representation of depen-
dency relations is straightforward, the representation
of heads is more difficult. Building on past exper-
iments (Schiehlen, 2003), we chose the “nth-tag”
representation which consists of three pieces of in-
formation: the POS tag of the head, the direction in
which the head lies (left or right), and the number of
words with the same POS tag between head and de-
pendent. We used the following features to describe
a word token: the fine-grained POS tag, the lemma
(or full form) if it occurs at least 10 times, the mor-
phosyntactic features, and the POS tags of the four
preceding and the four following word tokens. The
learner was trained in standard configuration (30 it-
erations). The results for this method on the test data
are shown in Figure 2 (line MaxEnt).
In a second experiment we added parsing results
(obtained by 10-fold cross validation on the training
set) in two features: proposed dependency relation
and proposed head. Results of the extended learning
approach are shown in Figure 2 (line combined).
5 Conclusion
We have presented a general approach to parsing
arbitrary languages based on dependency treebanks
that uses a minimum overhead of language-specific
information and nevertheless supplies competitive
results in some languages (Da, Du). Even better re-
sults can be reached if POS tag classifications are
used in the categories that are optimized for specific
languages (Ge). Markovization usually brings an
improvement of up to 2%, a higher gain is reached in
Slovene (where many new rules occur in the testset)
2http://homepages.inf.ed.ac.uk/s0450736/maxent_toolkit.html
and Chinese (which has the highest number of de-
pendency relations). Comparable results in the liter-
ature are Schiehlen’s (2004) 81.03% dependency f-
score reached on the German NEGRA treebank and
Collins et al.’s (1999) 80.0% labelled accuracy on
the Czech PDT treebank. Collins (1999) used a lex-
icalized approach, Schiehlen (2004) used the manu-
ally annotated phrasal categories of the treebank.
Our second result is that context-free parsing
can also boost the performance of a simple tagger-
like machine learning system. While a maximum-
entropy learner on its own achieves competitive re-
sults for only three languages (Ar, Po, Sl), compet-
itive results in basically all languages are produced
with access to the results of the probabilistic parser.
Thanks go to Helmut Schmid for providing sup-
port with his parser and the Markovization script.

References
S. Buchholz, E. Marsi, A. Dubey, and Y. Krymolowski.
2006. CoNLL-X shared task on multilingual depen-
dency parsing. In CoNLL-X. SIGNLL.
Michael J. Collins, Jan Hajiˆc, Lance Ramshaw, and
Christoph Tillmann. 1999. A Statistical Parser for
Czech. In ACL’99, College Park, MA.
Michael J. Collins. 1999. Head-Driven Statistical Meth-
ods for Natural Language Parsing. Ph.D. thesis, Univ.
of Pennsylvania.
Haim Gaifman. 1965. Dependency Systems and
Phrase-Structure Systems. Information and Control,
8(3):304–337.
Dan Klein and Christopher Manning. 2003. Accurate
Unlexicalized Parsing. In ACL’03, pages 423–430.
Michael Schiehlen. 2003. Combining Deep and Shal-
low Approaches in Parsing German. In ACL’03, pages
112–119, Sapporo, Japan.
Michael Schiehlen. 2004. Annotation Strategies for
Probabilistic Parsing in German. In COLING ’04,
pages 390–396, Geneva, Switzerland, August.
Helmut Schmid. 2004. Efficient Parsing of Highly Am-
biguous Context-Free Grammars with Bit Vectors. In
COLING ’04, Geneva, Switzerland.
Helmut Schmid. 2006. Trace Prediction and Recovery
with Unlexicalized PCFGs and Gap Threading. Sub-
mitted to COLING ’06.
Fei Xia and Martha Palmer. 2001. Converting depen-
dency structures to phrase structures. In HLT 2001.
