Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the ACL, pages 505–512,
Sydney, July 2006. c©2006 Association for Computational Linguistics
Creating a CCGbank and a wide-coverage CCG lexicon for German
Julia Hockenmaier
Institute for Research in Cognitive Science
University of Pennsylvania
Philadelphia, PA 19104, USA
juliahr@cis.upenn.edu
Abstract
We present an algorithm which creates a
German CCGbank by translating the syn-
tax graphs in the German Tiger corpus into
CCG derivation trees. The resulting cor-
pus contains 46,628 derivations, covering
95% of all complete sentences in Tiger.
Lexicons extracted from this corpus con-
tain correct lexical entries for 94% of all
known tokens in unseen text.
1 Introduction
A number of wide-coverage TAG, CCG, LFG and
HPSG grammars (Xia, 1999; Chen et al., 2005;
Hockenmaier and Steedman, 2002a; O’Donovan
et al., 2005; Miyao et al., 2004) have been ex-
tracted from the Penn Treebank (Marcus et al.,
1993), and have enabled the creation of wide-
coverage parsers for English which recover local
and non-local dependencies that approximate the
underlying predicate-argument structure (Hocken-
maier and Steedman, 2002b; Clark and Curran,
2004; Miyao and Tsujii, 2005; Shen and Joshi,
2005). However, many corpora (B¨ohomv´aetal.,
2003; Skut et al., 1997; Brants et al., 2002) use
dependency graphs or other representations, and
the extraction algorithms that have been developed
for Penn Treebank style corpora may not be im-
mediately applicable to this representation. As a
consequence, research on statistical parsing with
“deep” grammars has largely been confinedtoEn-
glish. Free-word order languages typically pose
greater challenges for syntactic theories (Rambow,
1994), and the richer inflectional morphology of
these languages creates additional problems both
for the coverage of lexicalized formalisms such
as CCG or TAG, and for the usefulness of de-
pendency counts extracted from the training data.
On the other hand, formalisms such as CCG and
TAG are particularly suited to capture the cross-
ing dependencies that arise in languages such as
Dutch or German, and by choosing an appropriate
linguistic representation, some of these problems
may be mitigated.
Here, we present an algorithm which translates
the German Tiger corpus (Brants et al., 2002) into
CCG derivations. Similar algorithms have been
developed by Hockenmaier and Steedman (2002a)
to create CCGbank, a corpus of CCG derivations
(Hockenmaier and Steedman, 2005) from the Penn
Treebank, by C¸akıcı (2005) to extract a CCG lex-
icon from a Turkish dependency corpus, and by
Moortgat and Moot(2002) toinduce atype-logical
grammar for Dutch.
The annotation scheme used in Tiger is an ex-
tension of that used in the earlier, and smaller,
German Negra corpus (Skut et al., 1997). Tiger
is better suited for the extraction of subcatego-
rization information (and thus the translation into
“deep” grammars of any kind), since it distin-
guishes between PP complements and modifiers,
and includes “secondary” edges to indicate shared
arguments in coordinate constructions. Tiger also
includes morphology and lemma information.
Negra is also provided with a “Penn Treebank”-
style representation, which uses flat phrase struc-
ture trees instead of the crossing dependency
structures in the original corpus. This version
has been used by Cahill et al. (2005) to extract a
German LFG. However, Dubey and Keller (2003)
have demonstrated that lexicalization does not
help a Collins-style parser that is trained on this
corpus, and Levy and Manning (2004) have shown
that its context-free representation is a poor ap-
proximation to the underlying dependency struc-
ture. The resource presented here will enable
future research to address the question whether
“deep” grammars such as CCG, which capture the
underlying dependencies directly, are better suited
to parsing German than linguistically inadequate
context-free approximations.
505
1. Standard main clause
Peter gibt Maria das Buch
C6C8CJD2CL B4B4CBCJDABDCLBPC6C8CJCPCLB5BPC6C8CJCSCLB5BPC6C8CJD2CL C6C8CJCSCL C6C8CJCPCL
BQCC BOCC BOCC
CBCJCSCRD0CLBPB4CBCJDABDCLBPC6C8CJD2CLB5 B4CBCJDABDCLBPC6C8CJCPCLB5D2B4B4CBCJDABDCLBPC6C8CJCPCLB5BPC6C8CJCSCLB5 CBCJDABDCLD2B4CBCJDABDCLBPC6C8CJCPCLB5
BOBU
B4CBCJDABDCLBPC6C8CJCPCLB5BPC6C8CJD2CL
BOBU
CBCJDABDCLBPC6C8CJD2CL
BQ
CBCJCSCRD0CL
2. Main clause with fronted adjunct 3. Main clause with fronted complement
dann gibt Peter Maria das Buch
CBBPCB B4B4CBCJDABDCLBPC6C8CJCPCLB5BPC6C8CJCSCLB5BPC6C8CJD2CL C6C8CJD2CL C6C8CJCSCL C6C8CJCPCL
BQ
CBCJCSCRD0CLBPCBCJDABDCL B4CBCJDABDCLBPC6C8CJCPCLB5BPC6C8CJCSCL
BQ
CBCJDABDCLBPC6C8CJCPCL
BQ
CBCJDABDCL
BO
CBCJCSCRD0CL
Maria gibt Peter das Buch
C6C8CJCSCL B4B4CBCJDABDCLBPC6C8CJCPCLB5BPC6C8CJCSCLB5BPC6C8CJD2CL C6C8CJD2CL C6C8CJCPCL
BQCC BQ BOCC
CBCJCSCRD0CLBPB4CBCJDABDCLD2C6C8CJCSCLB5 B4CBCJDABDCLBPC6C8CJCPCLB5BPC6C8CJCSCL CBCJDABDCLD2B4CBCJDABDCLBPC6C8CJCPCLB5
BQBU
CBCJDABDCLBPC6C8CJCSCL
BQBU
CBCJCSCRD0CL
Figure 1: CCG uses topicalization (1.), a type-changing rule (2.), and type-raising (3.) to capture the
different variants of German main clause order with the same lexical category for the verb.
2 German syntax and morphology
Morphology German verbs are inflected for
person, number, tense and mood. German nouns
and adjectives are inflected for number, case and
gender, and noun compounding is very productive.
Word order German has three different word
orders that depend on the clause type. Main
clauses (1) are verb-second. Imperatives and ques-
tions are verb-initial (2). If a modifier or one of
the objects is moved to the front, the word order
becomes verb-initial (2). Subordinate and relative
clauses are verb-final (3):
(1) a. Peter gibt Maria das Buch.
Peter gives Mary the book.
b. ein Buch gibt Peter Maria.
c. dann gibt Peter Maria das Buch.
(2) a. Gibt Peter Maria das Buch?
b. Gib Maria das Buch!
(3) a. dass Peter Maria das Buch gibt.
b. das Buch, das Peter Maria gibt.
Local Scrambling In the so-called “Mittelfeld”
all orders of arguments and adjuncts are poten-
tially possible. In the following example, all 5!
permutations are grammatical (Rambow, 1994):
(4) dass [eine Firma] [meinem Onkel] [die M¨obel] [vor
drei Tagen] [ohne Voranmeldung] zugestellt hat.
that [a company] [to my uncle] [the furniture] [three
days ago] [without notice] delivered has.
Long-distance scrambling Objects of embed-
ded verbs can also be extraposed unboundedly
within the same sentence (Rambow, 1994):
(5) dass [den Schrank] [niemand] [zu reparieren] ver-
sprochen hat.
that [the wardrobe] [nobody] [to repair] promised
has.
3 A CCG for German
3.1 Combinatory Categorial Grammar
CCG (Steedman (1996; 2000)) is a lexicalized
grammar formalism with a completely transparent
syntax-semantics interface. Since CCG is mildly
context-sensitive, it can capture the crossing de-
pendencies that arise in Dutch or German, yet is
efficiently parseable.
In categorial grammar, words are associ-
ated with syntactic categories, such as CBD2C6C8 or
B4CBD2C6C8B5BPC6C8 for English intransitive and transitive
verbs. Categories of the form CGBPCH or CGD2CH are func-
tors, which take an argument CH to their left or right
(depending on the the direction of the slash) and
yield a result CG. Every syntactic category is paired
with a semantic interpretation (usually a AL-term).
Like all variants of categorial grammar, CCG
uses function application to combine constituents,
but it also uses a set of combinatory rules such as
composition (BU) and type-raising (CC). Non-order-
preserving type-raising is used for topicalization:
Application: CGBPCH CH B5 CG
CH CGD2CH B5 CG
Composition: CGBPCH CHBPCI B5
BU
CGBPCI
CGBPCH CHD2CI B5
BU
CGD2CI
CHD2CI CGD2CH B5
BU
CGD2CI
CHBPCI CGD2CH B5
BU
CGBPCI
Type-raising: CG B5
CC
CCD2B4CCBPCGB5
Topicalization: CG B5
CC
CCBPB4CCBPCGB5
Hockenmaier and Steedman (2005) advocate
the use of additional “type-changing” rules to deal
with complex adjunct categories (e.g. B4C6C8D2C6C8B5 B5
CBCJD2CVCLD2C6C8 for ing-VPs that act as noun phrase mod-
ifiers). Here, we also use a small number of such
rules to deal with similar adjunct cases.
506
3.2 Capturing German word order
We follow Steedman (2000) in assuming that the
underlying word order in main clauses is always
verb-initial, and that the sententce-initial subject is
in fact topicalized. This enables us to capture dif-
ferent word orders with the same lexical category
(Figure 1). We use the features CBCJDABDCL and CBCJDAD0CPD7D8CL to
distinguish verbs in main and subordinate clauses.
Main clauses have the feature CBCJCSCRD0CL, requiring ei-
ther asentential modifier with category CBCJCSCRD0CLBPCBCJDABDCL,
a topicalized subject (CBCJCSCRD0CLBPB4CBCJDABDCLBPC6C8CJD2D3D1CLB5), or a
type-raised argument (CBCJCSCRD0CLBPB4CBCJDABDCLD2CGB5), where CG
can be any argument category, such as a noun
phrase, prepositional phrase, or a non-finite VP.
Here is the CCG derivation for the subordinate
clause (CBCJCTD1CQCL)example:
dass Peter Maria das Buch gibt
CBCJCTD1CQCLBPCBCJDAD0CPD7D8CL C6C8CJD2CL C6C8CJCSCL C6C8CJCPCLB4B4CBCJDAD0CPD7D8CLD2C6C8CJD2CLB5D2C6C8CJCSCLD2C6C8CJCPCL
BO
B4CBCJDAD0CPD7D8CLD2C6C8CJD2CLB5D2C6C8CJCSCL
BO
CBCJDAD0CPD7D8CLD2C6C8CJD2CL
BO
CBCJDAD0CPD7D8CL
BQ
CBCJCTD1CQCL
For simplicity’s sake our extraction algorithm
ignores the issues that arise through local scram-
bling, and assumes that there are different lexical
category for each permutation.
1
Type-raising and composition are also used to
deal with wh-extraction and with long-distance
scrambling (Figure 2).
4 Translating Tiger graphs into CCG
4.1 The Tiger corpus
The Tiger corpus (Brants et al., 2002) is a pub-
licly available
2
corpus of ca. 50,000 sentences (al-
most 900,000 tokens) taken from the Frankfurter
Rundschau newspaper. The annotation is based
on a hybrid framework which contains features of
phrase-structure and dependency grammar. Each
sentence is represented as a graph whose nodes
are labeled with syntactic categories (NP, VP, S,
PP, etc.) and POS tags. Edges are directed and la-
beled with syntactic functions (e.g. head, subject,
accusative object, conjunct, appositive). The edge
labels are similar to the Penn Treebank function
tags, but provide richer and more explicit infor-
mation. Only 72.5% of the graphs have no cross-
ing edges; the remaining 27.5% are marked as dis-
1
Variants of CCG, such as Set-CCG (Hoffman, 1995) and
Multimodal-CCG (Baldridge, 2002), allow a more compact
lexicon for free word order languages.
2
http://www.ims.uni-stuttgart.de/projekte/TIGER
continuous. 7.3% of the sentences have one or
more “secondary” edges, which are used to indi-
cate double dependencies that arise in coordinated
structures which are difficult to bracket, such as
right node raising, argument cluster coordination
or gapping. There are no traces or null elements to
indicate non-local dependencies or wh-movement.
Figure 2 shows the Tiger graph for a PP whose
NP argument is modified by a relative clause.
There is no NP level inside PPs (and no noun level
inside NPs). Punctuation marks are often attached
at the so-called “virtual” root (VROOT) of the en-
tire graph. The relative pronoun is a dative object
(edge label DA) of the embedded infinitive, and
is therefore attached at the VP level. The relative
clause itself has the category S; the incoming edge
is labeled RC (relative clause).
4.2 The translation algorithm
Our translation algorithm has the following steps:
translate(TigerGraph g):
TigerTree t = createTree(g);
preprocess(t);
if (t BIBP null)
CCGderiv d = translateToCCG(t);
if (d BIBP null);
if (isCCGderivation(d))
return d;
else fail;
else fail;
else fail;
1. Creating a planar tree: After an initial pre-
processing step which inserts punctuation that is
attached to the “virtual” root (VROOT) of the
graph in the appropriate locations, discontinuous
graphs are transformed into planar trees. Starting
at the lowest nonterminal nodes, this step turns
the Tiger graph into a planar tree without cross-
ing edges, where every node spans a contiguous
substring. This is required as input to the actual
translation step, since CCG derivations are pla-
nar binary trees. If the first to the CXth child of a
node CG span a contiguous substring that ends in
the CYth word, and the B4CXB7BDB5th child spans a sub-
string starting at CZ BQ CYB7BD, we attempt to move
the first CX children of CG to its parent C8 (if the
head position of C8 is greater than CX). Punctuation
marks and adjuncts are simply moved up the tree
and treated as if they were originally attached to
C8. This changes the syntactic scope of adjuncts,
but typically only VP modifiers are affected which
could also be attached at a higher VP or S node
without a change in meaning. The main exception
507
1. The original Tiger graph:
  an
  in
APPR
einem
  a
ART
Höchsten
 Highest
   NN
 dem
whom
PRELS
sich
refl.
PRF 
 fraglos
 without
questions
 ADJD 
 habe
 have
VAFIN
HD
HDMODA
SB OC
NKNKAC RC
PP
VP
 der
 the
ART
Mensch
human
  NN
kleine
small
ADJA
NK NK NK
NP
 S
     zu
     to
PTKZU
unterwerfen
   submit
  VVVIN
PM HD
VZ
OA
 
,
$, 
2. After transformation into a planar tree and preprocessing:
PP
APPR-AC
an
NP-ARG
ART-HD
einem
NOUN-ARG
NN-NK
H¨ochsten
PKT
,
SBAR-RC
PRELS-EXTRA-DA
dem
S-ARG
NP-SB
ART-NK
der
NOUN-ARG
ADJA-NK
kleine
NN-HD
Mensch
VP-OC
PRF-ADJ
sich
ADJD-MO
fraglos
VZ-HD
PTKZU-PM
zu
VVINF
unterwerfen
VAFIN-HD
habe
3. The resulting CCG derivation
C8C8
C8C8BPC6C8CJCSCPD8CL
an
C6C8CJCSCPD8CL
C6C8CJCSCPD8CL
C6C8CJCSCPD8CL
C6C8CJCSCPD8CLBPC6CJCSCPD8CL
einem
C6CJCSCPD8CL
H¨ochsten
BN
,
C6C8D2C6C8
B4C6C8D2C6C8B5BPB4CBCJDAD0CPD7D8CLD2C6C8CJCSCPD8CLB5
dem
CBCJDAD0CPD7D8CLD2C6C8CJCSCPD8CL
CBCJDAD0CPD7D8CLBPB4CBCJDAD0CPD7D8CLD2C6C8CJD2D3D1CLB5
C6C8CJD2D3D1CL
C6C8CJD2D3D1CLBPC6CJD2D3D1CL
der
C6CJD2D3D1CL
C6BPC6
kleine
C6CJD2D3D1CL
Mensch
B4CBCJDAD0CPD7D8CLD2C6C8CJD2D3D1CLB5D2C6C8CJCSCPD8CL
B4CBCJDECLD2C6C8B5D2C6C8CJCSCPD8CL
B4CBD2C6C8B5BPB4CBD2C6C8B5
sich
B4CBCJDECLD2C6C8B5D2C6C8CJCSCPD8CL
B4CBD2C6C8B5BPB4CBD2C6C8B5
fraglos
B4CBCJDECLD2C6C8B5D2C6C8CJCSCPD8CL
B4CBCJDECLD2C6C8B5BPB4CBCJCQCLD2C6C8B5
zu
B4CBCJCQCLD2C6C8B5D2C6C8CJCSCPD8CL
unterwerfen
B4CBCJDAD0CPD7D8CLD2C6C8CJD2D3D1CLB5D2B4CBCJDECLD2C6C8B5
habe
Figure 2: From Tiger graphs to CCG derivations
are extraposed relative clauses, which CCG treats
as sentential modifiers with an anaphoric depen-
dency. Arguments that are moved up are marked
as extracted, and an additional “extraction” edge
(explained below) from the original head is intro-
duced to capture the correct dependencies in the
CCG derivation. Discontinuous dependencies be-
tween resumptive pronouns (“place holders”, PH)
and their antecedents (“repeated elements”, RE)
are also dissolved.
2. Additional preprocessing: In order to obtain
the desired CCG analysis, a certain amount of pre-
processing is required. We insert NPs into PPs,
nouns into NPs
3
, and change sentences whose
first element is a complementizer (dass, ob,etc.)
into an SBAR (a category which does not ex-
ist in the original Tiger annotation) with S argu-
3
The span of nouns is given by the NK edge label.
ment. This is necessary to obtain the desired CCG
derivations where complementizers and preposi-
tions take a sentential or nominal argument to their
right, whereas they appear at the same level as
their arguments in the Tiger corpus. Further pre-
processing is required to create the required struc-
tures for wh-extraction and certain coordination
phenomena (see below).
In figure 2, preprocessing of the original Tiger
graph (top) yields the tree shown in the middle
(edge labels are shown as Penn Treebank-style
function tags).
4
We will first present the basic translation algo-
rithm before we explain how we obtain a deriva-
tion which captures the dependency between the
relative pronoun and the embedded verb.
4
We treat reflexive pronouns as modifiers.
508
3. The basic translation step Our basic transla-
tion algorithm is very similar to Hockenmaier and
Steedman (2005). It requires a planar tree with-
out crossing edges, where each node is marked as
head, complement or adjunct. The latter informa-
tion is represented in the Tiger edge labels, and
only a small number of additional head rules is re-
quired. Each individual translation step operates
on local trees, which are typically flat.
N
C
BD
C
BE
... C
CX
... C
D2A0BD
C
D2
Assuming the CCG category of C6 is CG, and its
head position is CX, the algorithm traverses first the
left nodes BV
BD
...BV
CXA0BD
from left to right to create a
right-branching derivation tree, and then the right
nodes (BV
D2
...BV
CXB7BD
) from right to left to create a
left-branching tree. Thealgorithm starts atthe root
category and recursively traverses the tree.
N
C
BD
L
BD
C
BE
L
BE
... R
R
R
H
CX
...
C
D2A0BD
C
D2
The CCG category of complements and of the
root of the graph is determined from their Tiger
label. VPs are CBCJBMCLD2C6C8, where the feature CJBMCL dis-
tinguishes bare infinitives, zu-infinitives, passives,
and (active) past participles. With the exception
of passives, these features can be determined from
the POS tags alone.
5
Embedded sentences (under
an SBAR-node) are always CBCJDAD0CPD7D8CL. NPs and nouns
(C6C8 and C6) have a case feature, e.g. CJD2D3D1CL.
6
Like
the English CCGbank, our grammar ignores num-
ber and person agreement.
Special cases: Wh-extraction and extraposition
In Tiger, wh-extraction is not explicitly marked.
Relative clauses, wh-questions and free relatives
are all annotated as S-nodes,and the wh-word is
a normal argument of the verb. After turning the
graph into a planar tree, we can identify these
constructions by searching for a relative pronoun
in the leftmost child of an S node (which may
be marked as extraposed in the case of extrac-
tion from an embedded verb). As shown in fig-
ure 2, we turn this S into an SBAR (a category
which does not exist in Tiger) with the first edge
as complementizer and move the remaining chil-
5
Eventive (“werden”) passive is easily identified by con-
text; however, we found that not all stative (“sein”) passives
seem to be annotated as such.
6
In some contexts, measure nouns (e.g. Mark, Kilometer)
lack case annotation.
dren under a new S node which becomes the sec-
ond daughter of the SBAR. The relative pronoun
is the head of this SBAR and takes the S-node as
argument. Its category is CBCJDAD0CPD7D8CL, since all clauses
with a complementizer are verb-final. In order to
capture the long-range dependency, a “trace” is
introduced, and percolated down the tree, much
like in the algorithm of Hockenmaier and Steed-
man (2005), and similar to GPSG’s slash-passing
(Gazdar et al., 1985). These trace categories are
appended to the category of the head node (and
other arguments are type-raised as necessary). In
our case, the trace is also associated with the verb
whose argument it is. If the span of this verb
is within the span of a complement, the trace is
percolated down this complement. When the VP
that is headed by this verb is reached, we assume
a canonical order of arguments in order to “dis-
charge” the trace.
If a complement node is marked as extraposed,
it is also percolated down the head tree until the
constituent whose argument it is is found. When
another complement is found whose span includes
the span of the constituent whose argument the ex-
traposed edge is, the extraposed category is perco-
lated down this tree (we assume extraction out of
adjuncts is impossible).
7
In order to capture the
topicalization analysis, main clause subjects also
introduce a trace. Fronted complements or sub-
jects, and the first adjunct in main clauses are ana-
lyzed as described in figure 1.
Special case: coordination – secondary edges
Tiger uses “secondary edges” to represent the de-
pendencies that arise in coordinate constructions
such as gapping, argument cluster coordination
and right (or left) node raising (Figure 3). In right
(left) node raising, the shared elements are argu-
ments or adjuncts that appear on the right periph-
ery of the last, (or left periphery of the first) con-
junct. CCG uses type-raising and composition to
combine the incomplete conjuncts into one con-
stituent which combines with the shared element:
liest immer und beantwortet gerne jeden Brief.
always reads and gladly replies to every letter.
B4CBCJDABDCLBPC6C8B5BPC6C8 CRD3D2CY B4CBCJDABDCLBPC6C8B5BPC6C8 C6C8
BOA8BQ
B4CBCJDABDCLBPC6C8B5BPC6C8
BQ
CBCJDABDCLBPC6C8
7
In our current implementation, each node cannot have
more than one forward and one backward extraposed element
andoneforwardandonebackward trace. Itmaybepreferable
to use list structures instead, especially for extraposition.
509
Complex coordinations: a Tiger graph with secondary edges
MO
während
  while
 KOUS
  78
  78
CARD
Prozent
percent
   NN
und
and
KON
sich
refl.
PRF 
 aussprachen
    argued
     VVFIN
HDSBCP
  für
  for
APPR
Bush
Bush
  NE
 S
OA
 
 vier
 vier
CARD
Prozent
percent
   NN
  für
  for
APPR
Clinton
Clinton
    NE
NKAC
PP
NKAC
PP
NKNK
NP
NKNK
NP
SBMO
 S
CDCJ CJ
CS
The planar tree after preprocessing:
SBAR
KOUS-HD
w¨ahrend
S-ARG
ARGCLUSTER
S-CJ
NP-SB
78 Prozent
PRF-MO
sich
PP-MO
f¨ur Bush
KON-CD
und
S-CJ
NP-SB
vier Prozent
PP-MO
f¨ur Clinton
VVFIN-HD
aussprachen
The resulting CCG derivation:
CBD2CB
B4CBD2CBB5BPCBCJCSCRD0CL
w¨ahrend
CBCJDAD0CPD7D8CL
CBCJDAD0CPD7D8CLBPB4CBCJDAD0CPD7D8CLD2C6C8CJD2D3D1CLB5
CBCJDAD0CPD7D8CLBPB4CBCJDAD0CPD7D8CLD2C6C8CJD2D3D1CLB5
CBCJDAD0CPD7D8CLBPB4CBCJDAD0CPD7D8CLD2C6C8CJD2D3D1CLB5
CBCJDAD0CPD7D8CLBPB4CBCJDAD0CPD7D8CLD2C6C8CJD2D3D1CLB5
C6C8CJD2D3D1CL
78 Prozent
B4CBD2C6C8B5BPB4CBD2C6C8B5
sich
B4CBD2C6C8B5BPB4CBD2C6C8B5
f¨ur Bush
CBCJDAD0CPD7D8CLBPB4CBCJDAD0CPD7D8CLD2C6C8CJD2D3D1CLB5CJCRD3D2CYCL
CRD3D2CY
und
CBCJDAD0CPD7D8CLBPB4CBCJDAD0CPD7D8CLD2C6C8CJD2D3D1CLB5
CBCJDAD0CPD7D8CLBPB4CBCJDAD0CPD7D8CLD2C6C8CJD2D3D1CLB5
C6C8CJD2D3D1CL
vier Prozent
B4CBD2C6C8B5BPB4CBD2C6C8B5
f¨ur Clinton
CBCJDAD0CPD7D8CLD2C6C8CJD2D3D1CL
aussprachen
Figure 3: Processing secondary edges in Tiger
In order to obtain this analysis, we lift such
shared peripheral constituents inside the conjuncts
of conjoined sentences CS (or verb phrases, CVP)
to new S (VP) level that we insert in between the
CS and its parent.
In argument cluster coordination (Figure 3), the
shared peripheral element (aussprachen)isthe
head.
8
In CCG, the remaining arguments and ad-
juncts combine via composition and typeraising
into a functor category which takes the category of
the head as argument (e.g. a ditransitive verb), and
returns the same category that would result from
a non-coordinated structure (e.g. a VP). The re-
sult category of the furthest element in each con-
junct is equal to the category of the entire VP (or
sentence), and all other elements are type-raised
and composed with this to yield a category which
takes as argument a verb with the required subcat
frame and returns a verb phrase (sentence). Tiger
assumes instead that there are two conjuncts (one
of which is headless), and uses secondary edges
8
W¨ahrend has scope over the entire coordinated structure.
to indicate the dependencies between the head and
the elements in the distant conjunct. Coordinated
sentences and VPs (CS and CVP) that have this
annotation are rebracketed to obtain the CCG con-
stituent structure, and the conjuncts are marked as
argument clusters. Since the edges in the argu-
ment cluster are labeled with their correct syntac-
tic functions, we are able to mimic the derivation
during category assignment.
In sentential gapping, the main verb is shared
and appears in the middle of the first conjunct:
(6) Er trinkt Bier und sie Wein.
He drinks beer and she wine.
Asin the English CCGbank, we ignore this con-
struction, which requires a non-combinatory “de-
composition” rule (Steedman, 1990).
5 Evaluation
Translation coverage The algorithm can fail at
several stages. If the graph cannot be turned into a
tree, it cannot be translated. This happens in 1.3%
(647) of all sentences. In many cases, this is due
510
to coordinated NPs or PPs where one or more con-
juncts are extraposed. We believe that these are
anaphoric, and further preprocessing could take
care of this. In other cases, this is due to verb top-
icalization (gegeben hat Peter Maria das Buch),
which our algorithm cannot currently deal with.
9
For 1.9% of the sentences, the algorithm cannot
obtain a correct CCG derivation. Mostly this is
the case because some traces and extraposed el-
ements cannot be discharged properly. Typically
this happens either in local scrambling, where an
object of the main verb appears between the aux-
iliary and the subject (hat das Buch Peter...)
10
,or
when an argument of a noun that appears in a rel-
ative clause is extraposed to the right. There are
also a small number of constituents whose head is
not annotated. We ignore any gapping construc-
tion or argument cluster coordination that we can-
not get into the right shape (1.5%), 732 sentences).
There are also a number of other constructions
that we do not currently deal with. We do not pro-
cess sentences if the root of the graph is a “virtual
root” that does not expand into a sentence (1.7%,
869). This is mostly the case for strings such as
Frankfurt (Reuters)), or if we cannot identify a
head child of the root node (1.3%, 648; mostly
fragments or elliptical constructions).
Overall, we obtain CCG derivations for 92.4%
(46,628) of all 54,0474 sentences, including
88.4% (12,122) of those whose Tiger graphs are
marked as discontinuous (13,717), and 95.2%
of all 48,957 full sentences (excluding headless
roots, and fragments, but counting coordinate
structures such as gapping).
Lexicon size There are 2,506 lexical category
types, but 1,018 of these appear only once. 933
category types appear more than 5 times.
Lexical coverage In order to evaluate coverage
of the extracted lexicon on unseen data, we split
the corpus into segments of 5,000 sentences (ig-
noring the last 474), and perform 10-fold cross-
validation, using 9 segments to extract a lexicon
and the 10th to test its coverage. Average cover-
age is 86.7% (by token) of all lexical categories.
Coverage varies between 84.4% and 87.6%. On
average, 92% (90.3%-92.6%) of the lexical tokens
9
The corresponding CCG derivation combines the rem-
nant complements as in argument cluster coordination.
10
This problem arises because Tiger annotates subjects as
arguments of the auxiliary. We believe this problem could be
avoided if they were instead arguments of the non-finite verb.
that appear in the held-out data also appear in the
training data. On these seen tokens, coverage is
94.2% (93.5%-92.6%). More than half of all miss-
ing lexical entries are nouns.
In the English CCGbank, a lexicon extracted
from section 02-21 (930,000 tokens) has 94% cov-
erage on all tokens in section 00, and 97.7% cov-
erage on all seen tokens (Hockenmaier and Steed-
man, 2005). In the English data set, the proportion
of seen tokens (96.2%) is much higher, most likely
because of the relative lack of derivational and in-
flectional morphology. The better lexical coverage
onseen tokens isalso tobe expected, given thatthe
flexible word order of German requires case mark-
ings on all nouns as well as at least two different
categories for each tensed verb, and more in order
to account for local scrambling.
6 Conclusion and future work
We have presented an algorithm which converts
the syntax graphs in the German Tiger corpus
(Brants et al., 2002) into Combinatory Catego-
rial Grammar derivation trees. This algorithm is
currently able to translate 92.4% of all graphs in
Tiger, or 95.2% of all full sentences. Lexicons
extracted from this corpus contain the correct en-
tries for 86.7% of all and 94.2% of all seen to-
kens. Good lexical coverage is essential for the
performance of statistical CCG parsers (Hocken-
maier and Steedman, 2002a). Since the Tiger cor-
pus contains complete morphological and lemma
information for all words, future work will address
the question of how to identify and apply a set of
(non-recursive) lexical rules (Carpenter, 1992) to
the extracted CCG lexicon to create a much larger
lexicon. The number of lexical category types is
almost twice as large as that of the English CCG-
bank. This is to be expected, since our gram-
mar includes case features, and German verbs re-
quire different categories for main and subordinate
clauses. We currently perform only the most es-
sential preprocessing steps, although there are a
number of constructions that might benefit from
additional changes (e.g. comparatives, parentheti-
cals, or fragments), both to increase coverage and
accuracy of the extracted grammar.
Since Tiger corpus is of comparable size to the
Penn Treebank, we hope that the work presented
here will stimulate research into statistical wide-
coverage parsing of free word order languages
such as German with deep grammars like CCG.
511
Acknowledgments
I would like to thank Mark Steedman and Aravind
Joshi for many helpful discussions. This research
is supported by NSF ITR grant 0205456.
References
Jason Baldridge. 2002. Lexically Specified Derivational
Control in Combinatory Categorial Grammar.Ph.D.the-
sis, School of Informatics, University of Edinburgh.
Alena B¨ohomv´a, Jan Hajiˇc, Eva Hajiˇcov´a, and Barbora
Hladk´a. 2003. The Prague Dependency Treebank: Three-
level annotation scenario. In Anne Abeill´e, editor, Tree-
banks: Building and Using Syntactially Annotated Cor-
pora.Kluwer.
Sabine Brants, Stefanie Dipper, Silvia Hansen, Wolfgang
Lexius, and George Smith. 2002. The TIGER tree-
bank. In Workshop on Treebanks and Linguistic Theories,
Sozpol.
Aoife Cahill, Martin Forst, Mairead McCarthy, Ruth
O’Donovan, Christian Rohrer, Josef van Genabith, and
Andy Way. 2005. Treebank-based acquisition of multilin-
gual unification-grammar resources. Journal of Research
on Language and Computation.
Ruken C¸akıcı. 2005. Automatic induction of a CCG gram-
mar for Turkish. In ACL Student Research Workshop,
pages 73–78, Ann Arbor, MI, June.
Bob Carpenter. 1992. Categorial grammars, lexical rules,
and the English predicative. In Robert Levine, editor, For-
mal Grammar: Theory and Implementation, chapter 3.
Oxford University Press.
John Chen, Srinivas Bangalore, and K. Vijay-Shanker. 2005.
Automated extraction of Tree-Adjoining Grammars from
treebanks. Natural Language Engineering.
Stephen Clark and James R. Curran. 2004. Parsing the
WSJ using CCG and log-linear models. In Proceedings
of the 42nd Annual Meeting of the Association for Com-
putational Linguistics, Barcelona, Spain.
Amit Dubey and Frank Keller. 2003. Probabilistic parsing
for German using Sister-Head dependencies. In Erhard
Hinrichs and Dan Roth, editors, Proceedings of the 41st
Annual Meeting of the Association for Computational Lin-
guistics, pages 96–103, Sapporo, Japan.
Gerald Gazdar, Ewan Klein, Geoffrey K. Pullum, and Ivan A.
Sag. 1985. Generalised Phrase Structure Grammar.
Blackwell, Oxford.
Julia Hockenmaier and Mark Steedman. 2002a. Acquir-
ing compact lexicalized grammars from a cleaner Tree-
bank. In Proceedings of the Third International Con-
ference on Language Resources and Evaluation (LREC),
pages 1974–1981, Las Palmas, Spain, May.
Julia Hockenmaier and Mark Steedman. 2002b. Generative
modelsforstatisticalparsingwithCombinatory Categorial
Grammar. In Proceedings of the 40th Annual Meeting of
theAssociation for Computational Linguistics, pages 335–
342, Philadelphia, PA.
Julia Hockenmaier and Mark Steedman. 2005. CCGbank:
Users’ manual. Technical Report MS-CIS-05-09, Com-
puter and Information Science, University of Pennsylva-
nia.
Beryl Hoffman. 1995. Computational Analysis of the Syntax
and Interpretation of ‘Free’ Word-order in Turkish.Ph.D.
thesis, University of Pennsylvania. IRCS Report 95-17.
Roger Levy and Christopher Manning. 2004. Deep depen-
dencies from context-free statistical parsers: correcting
the surface dependency approximation. In Proceedings
of the 42nd Annual Meeting of the Association for Com-
putational Linguistics.
Mitchell P. Marcus, Beatrice Santorini, and Mary Ann
Marcinkiewicz. 1993. Building a large annotated corpus
of English: the Penn Treebank. Computational Linguis-
tics, 19:313–330.
Yusuke Miyao and Jun’ichi Tsujii. 2005. Probabilistic dis-
ambiguation models for wide-coverage HPSG parsing. In
Proceedings of the 43rd Annual Meeting of the Associa-
tion for Computational Linguistics, pages 83–90, Ann Ar-
bor, MI.
Yusuke Miyao, Takashi Ninomiya, and Jun’ichi Tsujii. 2004.
Corpus-oriented grammar development for acquiring a
Head-driven Phrase Structure Grammar from the Penn
Treebank. In Proceedings of the First International Joint
Conference on Natural Language Processing (IJCNLP-
04).
Michael Moortgat and Richard Moot. 2002. Using the Spo-
ken Dutch Corpus for type-logical grammar induction.
In Proceedings of the Third International Conference on
Language Resources and Evaluation (LREC).
Ruth O’Donovan, Michael Burke, Aoife Cahill, Josef van
Genabith, and Andy Way. 2005. Large-scale induc-
tion and evaluation of lexical resources from the Penn-
II and Penn-III Treebanks. Computational Linguistics,
31(3):329 – 365, September.
Owen Rambow. 1994. Formal and Computational Aspects
of Natural Language Syntax. Ph.D. thesis, University of
Pennsylvania, Philadelphia PA.
Libin Shen and Aravind K. Joshi. 2005. Incremental LTAG
parsing. In Proceedings of the Human Language Tech-
nology Conference / Conference of Empirical Methods in
Natural Language Processing (HLT/EMNLP).
Wojciech Skut, Brigitte Krenn, Thorsten Brants, and Hans
Uszkoreit. 1997. An annotation scheme for free word
order languages. In Fifth Conference on Applied Natural
Language Processing.
Mark Steedman. 1990. Gapping as constituent coordination.
Linguistics and Philosophy, 13:207–263.
Mark Steedman. 1996. Surface Structure and Interpretation.
MIT Press, Cambridge, MA. Linguistic Inquiry Mono-
graph, 30.
Mark Steedman. 2000. The Syntactic Process. MIT Press,
Cambridge, MA.
Fei Xia. 1999. Extracting Tree Adjoining Grammars from
bracketed corpora. In Proceedings of the 5th Natural Lan-
guage Processing Pacific Rim Symposium (NLPRS-99).
512
