Proceedings of the Human Language Technology Conference of the North American Chapter of the ACL, pages 144–151,
New York, June 2006. c©2006 Association for Computational Linguistics
Partial Training for a Lexicalized-Grammar Parser
Stephen Clark
Oxford University Computing Laboratory
Wolfson Building, Parks Road
Oxford, OX1 3QD, UK
stephen.clark@comlab.ox.ac.uk
James R. Curran
School of Information Technologies
University of Sydney
NSW 2006, Australia
james@it.usyd.edu.au
Abstract
We propose a solution to the annotation
bottleneck for statistical parsing, by ex-
ploiting the lexicalized nature of Combi-
natory Categorial Grammar (CCG). The
parsing model uses predicate-argument
dependencies for training, which are de-
rived from sequences of CCG lexical cate-
gories rather than full derivations. A sim-
ple method is used for extracting depen-
dencies from lexical category sequences,
resulting in high precision, yet incomplete
and noisy data. The dependency parsing
model of Clark and Curran (2004b) is ex-
tended to exploit this partial training data.
Remarkably, the accuracy of the parser
trained on data derived from category se-
quences alone is only 1.3% worse in terms
of F-score than the parser trained on com-
plete dependency structures.
1 Introduction
State-of-the-art statistical parsers require large
amounts of hand-annotated training data, and are
typically based on the Penn Treebank, the largest
treebank available for English. Even robust parsers
using linguistically sophisticated formalisms, such
as TAG (Chiang, 2000), CCG (Clark and Curran,
2004b; Hockenmaier, 2003), HPSG (Miyao et al.,
2004) and LFG (Riezler et al., 2002; Cahill et al.,
2004), often use training data derived from the Penn
Treebank. The labour-intensive nature of the tree-
bank development process, which can take many
years, creates a significant barrier for the develop-
ment of parsers for new domains and languages.
Previous work has attempted parser adaptation
without relying on treebank data from the new do-
main (Steedman et al., 2003; Lease and Charniak,
2005). In this paper we propose the use of anno-
tated data in the new domain, but only partially an-
notated data, which reduces the annotation effort re-
quired (Hwa, 1999). We develop a parsing model
whichcanbetrainedusingpartialdata,byexploiting
the properties of lexicalized grammar formalisms.
The formalism we use is Combinatory Categorial
Grammar (Steedman, 2000), together with a parsing
model described in Clark and Curran (2004b) which
we adapt for use with partial data.
Parsing with Combinatory Categorial Grammar
(CCG) takes place in two stages: first, CCG lexical
categories are assigned to the words in the sentence,
and then the categories are combined by the parser
(Clark and Curran, 2004a). The lexical categories
can be thought of as detailed part of speech tags and
typically express subcategorization information. We
exploit the fact that CCG lexical categories contain
a lot of syntactic information, and can therefore be
used for training a full parser, even though attach-
ment information is not explicitly represented in a
category sequence. Our partial training regime only
requires sentences to be annotated with lexical cate-
gories, rather than full parse trees; therefore the data
can be produced much more quickly for a new do-
main or language (Clark et al., 2004).
The partial training method uses the log-linear
dependency model described in Clark and Curran
(2004b), which uses sets of predicate-argument de-
144
pendencies, ratherthanderivations, fortraining. Our
novel idea is that, since there is so much informa-
tion in the lexical category sequence, most of the
correct dependencies can be easily inferred from the
categories alone. More specifically, for a given sen-
tence and lexical category sequence, we train on
those predicate-argument dependencies which occur
in k% of the derivations licenced by the lexical cat-
egories. By setting the k parameter high, we can
produce a set of high precision dependencies for
training. A similar idea is proposed by Carroll and
Briscoe (2002) for producing high precision data for
lexical acquisition.
Using this procedure we are able to produce de-
pendency data with over 99% precision and, re-
markably, up to 86% recall, when compared against
the complete gold-standard dependency data. The
high recall figure results from the significant amount
of syntactic information in the lexical categories,
which reduces the ambiguity in the possible depen-
dency structures. Since the recall is not 100%, we
require a log-linear training method which works
with partial data. Riezler et al. (2002) describe a
partial training method for a log-linear LFG parsing
model in which the “correct” LFG derivations for a
sentence are those consistent with the less detailed
gold standard derivation from the Penn Treebank.
We use a similar method here by treating a CCG
derivation as correct if it is consistent with the high-
precisionpartialdependencystructure. Section3ex-
plains what we mean by consistency in this context.
Surprisingly, the accuracy of the parser trained on
partial data approaches that of the parser trained on
full data: our best partial-data model is only 1.3%
worse in terms of dependency F-score than the full-
data model, despite the fact that the partial data does
not contain any explicit attachment information.
2 The CCG Parsing Model
Clark and Curran (2004b) describes two log-linear
parsing models for CCG: a normal-form derivation
model and a dependency model. In this paper we
use the dependency model, which requires sets of
predicate-argument dependencies for training.1
1Hockenmaier and Steedman (2002) describe a generative
model of normal-form derivations; one possibility for training
this model on partial data, which has not been explored, is to
use the EM algorithm (Pereira and Schabes, 1992).
The predicate-argument dependencies are repre-
sented as 5-tuples: 〈hf,f,s,ha,l〉, where hf is the
lexical item of the lexical category expressing the
dependency relation; f is the lexical category; s is
the argument slot; ha is the head word of the ar-
gument; and l encodes whether the dependency is
non-local. For example, the dependency encoding
company as the object of bought (as in IBM bought
the company) is represented as follows:
〈bought2, (S\NP1)/NP2, 2, company4,−〉 (1)
CCG dependency structures are sets of predicate-
argument dependencies. We define the probability
ofadependencystructureasthesumoftheprobabil-
ities of all those derivations leading to that structure
(Clark and Curran, 2004b). “Spurious ambiguity” in
CCG means that there can be more than one deriva-
tion leading to any one dependency structure. Thus,
the probability of a dependency structure, pi, given a
sentence, S, is defined as follows:
P(pi|S) = summationdisplay
d∈∆(pi)
P(d,pi|S) (2)
where ∆(pi) is the set of derivations which lead to pi.
The probability of a 〈d,pi〉 pair, ω, conditional on
a sentence S, is defined using a log-linear form:
P(ω|S) = 1Z
S
eλ.f(ω) (3)
where λ.f(ω) =summationtexti λifi(ω). The function fi is the
integer-valued frequency function of the ith feature;
λi is the weight of the ith feature; and ZS is a nor-
malising constant.
Clark and Curran (2004b) describes the training
procedure for the dependency model, which uses a
discriminativeestimationmethodbymaximisingthe
conditional likelihood of the model given the data
(Riezler et al., 2002). The optimisation of the objec-
tive function is performed using the limited-memory
BFGS numerical optimisation algorithm (Nocedal
and Wright, 1999; Malouf, 2002), which requires
calculation of the objective function and the gradi-
ent of the objective function at each iteration.
The objective function is defined below, where
L(Λ) is the likelihood and G(Λ) is a Gaussian prior
term for smoothing.
145
He anticipates growth for the auto maker
NP (S[dcl]\NP)/NP NP (NP\NP)/NP NP[nb]/N N/N N
Figure 1: Example sentence with CCG lexical categories
Lprime(Λ) = L(Λ) −G(Λ) (4)
=
msummationdisplay
j=1
log summationdisplay
d∈∆(pij)
eλ.f(d,pij)
−
msummationdisplay
j=1
log summationdisplay
ω∈ρ(Sj)
eλ.f(ω) −
nsummationdisplay
i=1
λ2i
2σ2
S1,...,Sm are the sentences in the training data;
pi1,...,pim are the corresponding gold-standard de-
pendency structures; ρ(S) is the set of possible
〈derivation, dependency-structure〉 pairs for S; σ is
a smoothing parameter; and n is the number of fea-
tures. The components of the gradient vector are:
∂Lprime(Λ)
∂λi =
msummationdisplay
j=1
summationdisplay
d∈∆(pij)
eλ.f(d,pij)fi(d,pij)
summationtext
d∈∆(pij) eλ.f(d,pij)
(5)
−
msummationdisplay
j=1
summationdisplay
ω∈ρ(Sj)
eλ.f(ω)fi(ω)
summationtext
ω∈ρ(Sj) eλ.f(ω)
− λiσ2
The first two terms of the gradient are expecta-
tions of feature fi: the first expectation is over
all derivations leading to each gold-standard depen-
dency structure, and the second is over all deriva-
tions for each sentence in the training data. The es-
timation process attempts to make the expectations
in (5) equal (ignoring the Gaussian prior term). An-
other way to think of the estimation process is that
it attempts to put as much mass as possible on the
derivations leading to the gold-standard structures
(Riezler et al., 2002).
Calculation of the feature expectations requires
summing over all derivations for a sentence, and
summing over all derivations leading to a gold-
standard dependency structure. Clark and Cur-
ran (2003) shows how the sum over the complete
derivation space can be performed efficiently using
a packed chart and the inside-outside algorithm, and
Clark and Curran (2004b) extends this method to
sum over all derivations leading to a gold-standard
dependency structure.
3 Partial Training
The partial data we use for training the dependency
model is derived from CCG lexical category se-
quences only. Figure 1 gives an example sentence
adapted from CCGbank (Hockenmaier, 2003) to-
gether with its lexical category sequence. Note that,
although the attachment of the prepositional phrase
to the noun phrase is not explicitly represented, it
can be inferred in this example because the lexical
category assigned to the preposition has to combine
with a noun phrase to the left, and in this example
there is only one possibility. One of the key insights
in this paper is that the significant amount of syntac-
tic information in CCG lexical categories allows us
to infer attachment information in many cases.
Theprocedureweuseforextractingdependencies
from a sequence of lexical categories is to return all
thosedependencieswhichoccurink%ofthederiva-
tions licenced by the categories. By giving the k pa-
rameter a high value, we can extract sets of depen-
dencies with very high precision; in fact, assuming
that the correct lexical category sequence licences
the correct derivation, setting k to 100 must result in
100% precision, sinceanydependency whichoccurs
in every derivation must occur in the correct deriva-
tion. Of course the recall is not guaranteed to be
high; decreasingk has the effect of increasing recall,
but at the cost of decreasing precision.
The training method described in Section 2 can
be adapted to use the (potentially incomplete) sets
of dependencies returned by our extraction proce-
dure. In Section 2 a derivation was considered cor-
rect if it produced the complete set of gold-standard
dependencies. In our partial-data version a deriva-
tion is considered correct if it produces dependen-
cies which are consistent with the dependencies re-
turned by our extraction procedure. We define con-
sistency as follows: a set of dependencies D is con-
sistent with a set G if G is a subset of D. We also
say that a derivation d is consistent with dependency
set G if G is a subset of the dependencies produced
by d.
146
This definition of “correct derivation” will intro-
duce some noise into the training data. Noise arises
from sentences where the recall of the extracted de-
pendencies is less than 100%, since some of the
derivations which are consistent with the extracted
dependencies for such sentences will be incorrect.
Noise also arises from sentences where the preci-
sionoftheextracteddependenciesislessthan100%,
since for these sentences every derivation which is
consistent with the extracted dependencies will be
incorrect. The hope is that, if an incorrect derivation
produces mostly correct dependencies, then it can
still be useful for training. Section 4 shows how the
precision and recall of the extracted dependencies
varies with k and how this affects parsing accuracy.
The definitions of the objective function (4) and
the gradient (5) for training remain the same in the
partial-data case; the only differences are that ∆(pi)
isnowdefinedtobethosederivationswhicharecon-
sistent with the partial dependency structure pi, and
the gold-standard dependency structures pij are the
partial structures extracted from the gold-standard
lexical category sequences.2
Clark and Curran (2004b) gives an algorithm for
finding all derivations in a packed chart which pro-
duce a particular set of dependencies. This algo-
rithm is required for calculating the value of the ob-
jective function (4) and the first feature expectation
in (5). We adapt this algorithm for finding all deriva-
tions which are consistent with a partial dependency
structure. The new algorithm is shown in Figure 2.
The algorithm relies on the definition of a packed
chart, which is an instance of a feature forest (Miyao
and Tsujii, 2002). The idea behind a packed chart is
that equivalent chart entries of the same type and in
the same cell are grouped together, and back point-
ers to the daughters indicate how an individual entry
was created. Equivalent entries form the same struc-
tures in any subsequent parsing.
A feature forest is defined in terms of disjunctive
and conjunctive nodes. For a packed chart, the indi-
vidual entries in acell are conjunctive nodes, and the
equivalence classes of entries are disjunctive nodes.
The definition of a feature forest is as follows:
A feature forest Φ is a tuple 〈C,D,R,γ,δ〉 where:
2Note that the procedure does return all the gold-standard
dependencies for some sentences.
〈C,D,R,γ,δ〉 is a packed chart / feature forest
G is a set of dependencies returned by the extraction procedure
Let c be a conjunctive node
Let d be a disjunctive node
deps(c) is the set of dependencies on node c
cdeps(c) = |deps(c) ∩ G|
dmax(c) =summationtextd∈δ(c) dmax(d)+ cdeps(c)
dmax(d) = max{dmax(c) | c ∈ γ(d)}
mark(d):
mark d as a correct node
foreach c ∈ γ(d)
if dmax(c) == dmax(d)
mark c as a correct node
foreach dprime ∈ δ(c)
mark(dprime)
foreach dr ∈ R such that dmax. (dr) = |G|
mark(dr)
Figure 2: Finding nodes in derivations consistent
with a partial dependency structure
• C is a set of conjunctive nodes;
• D is a set of disjunctive nodes;
• R ⊆ D is a set of root disjunctive nodes;
• γ : D → 2C isaconjunctivedaughterfunction;
• δ : C → 2D is a disjunctive daughter function.
Dependencies are associated with conjunctive
nodes in the feature forest. For example, if the
disjunctive nodes (equivalence classes of individual
entries) representing the categories NP and S\NP
combine to produce a conjunctive node S, the re-
sulting S node will have a verb-subject dependency
associated with it.
In Figure 2, cdeps(c) is the number of dependen-
cies on conjunctive node c which appear in partial
structure G; dmax(c) is the maximum number of
dependencies in G produced by any sub-derivation
headed by c; dmax(d) is the same value for disjunc-
tive node d. Recursive definitions for calculating
these values are given; the base case occurs when
conjunctive nodes have no disjunctive daughters.
Thealgorithmidentifiesallthoserootnodeshead-
ing derivations which are consistent with the partial
dependency structure G, and traverses the chart top-
down marking the nodes in those derivations. The
insight behind the algorithm is that, for two con-
junctive nodes in the same equivalence class, if one
node heads a sub-derivation producing more depen-
dencies in G than the other node, then the node with
147
less dependencies inGcannot be part of a derivation
consistent with G.
The conjunctive and disjunctive nodes appearing
in derivations consistent with G form a new “gold-
standard” feature forest. The gold-standard forest,
and the complete forest containing all derivations
spanning the sentence, can be used to estimate the
likelihood value and feature expectations required
by the estimation algorithm. Let EΦΛfi be the ex-
pected value of fi over the forest Φ for model Λ;
then the values in (5) can be obtained by calculating
EΦjΛ fi for the complete forest Φj for each sentence
Sj in the training data (the second sum in (5)), and
also EΨjΛ fi for each forest Ψj of derivations consis-
tentwiththepartialgold-standarddependencystruc-
ture for sentence Sj (the first sum in (5)):
∂L(Λ)
∂λi =
msummationdisplay
j=1
(EΨjΛ fi −EΦjΛ fi) (6)
The likelihood in (4) can be calculated as follows:
L(Λ) =
msummationdisplay
j=1
(logZΨj − logZΦj) (7)
where logZΦ is the normalisation constant for Φ.
4 Experiments
The resource used for the experiments is CCGbank
(Hockenmaier, 2003), which consists of normal-
form CCG derivations derived from the phrase-
structure trees in the Penn Treebank. It also contains
predicate-argument dependencies which we use for
development and final evaluation.
4.1 Accuracy of Dependency Extraction
Sections 2-21 of CCGbank were used to investigate
the accuracy of the partial dependency structures re-
turned by the extraction procedure. Full, correct de-
pendency structures for the sentences in 2-21 were
created by running our CCG parser (Clark and Cur-
ran, 2004b) over the gold-standard derivation for
each sentence, outputting the dependencies. This re-
sultedinfulldependencystructuresfor37,283ofthe
sentences in sections 2-21.
Table 1 gives precision and recall values for the
dependencies obtained from the extraction proce-
dure, for the 37,283 sentences for which we have
k Precision Recall SentAcc
0.99999 99.76 74.96 13.84
0.9 99.69 79.37 16.52
0.85 99.65 81.30 18.40
0.8 99.57 82.96 19.51
0.7 99.09 85.87 22.46
0.6 98.00 88.67 26.28
Table 1: Accuracy of the Partial Dependency Data
complete dependency structures. The SentAcc col-
umn gives the percentage of training sentences for
which the partial dependency structures are com-
pletely correct. For a given sentence, the extrac-
tion procedure returns all dependencies occurring in
at least k% of the derivations licenced by the gold-
standard lexical category sequence. The lexical cat-
egory sequences for the sentences in 2-21 can easily
be read off the CCGbank derivations.
The derivations licenced by a lexical category se-
quence were created using the CCG parser described
inClarkandCurran(2004b). Theparserusesasmall
number of combinatory rules to combine the cate-
gories, along with the CKY chart-parsing algorithm
described in Steedman (2000). It also uses some
unary type-changing rules and punctuation rules ob-
tainedfromthederivationsinCCGbank.3 Theparser
builds a packed representation, and counting the
number of derivations in which a dependency occurs
can be performed using a dynamic programming al-
gorithm similar to the inside-outside algorithm.
Table 1 shows that, by varying the value of k, it
is possible to get the recall of the extracted depen-
dencies as high as 85.9%, while still maintaining a
precision value of over 99%.
4.2 Accuracy of the Parser
The training data for the dependency model was cre-
ated by first supertagging the sentences in sections
2-21, using the supertagger described in Clark and
Curran (2004b).4 The average number of categories
3Since our training method is intended to be applicable in
the absence of derivation data, the use of such rules may appear
suspect. However, we argue that the type-changing and punc-
tuation rules could be manually created for a new domain by
examining the lexical category data.
4An improved version of the supertagger was used for this
paper in which the forward-backward algorithm is used to cal-
culate the lexical category probability distributions.
148
assigned to each word is determined by a parameter,
β, in the supertagger. A category is assigned to a
word if the category’s probability is within β of the
highest probability category for that word.
For these experiments, we used a β value of 0.01,
which assigns roughly 1.6 categories to each word,
on average; we also ensured that the correct lexi-
cal category was in the set assigned to each word.
(We did not do this when parsing the test data.) For
some sentences, the packed charts can become very
large. Thesupertaggingapproachweadoptfortrain-
ing differs to that used for testing: if the size of the
chart exceeds some threshold, the value of β is in-
creased, reducing ambiguity, and the sentence is su-
pertagged and parsed again. The threshold which
limits the size of the charts was set at 300000 indi-
vidual entries. Two further values of β were used:
0.05 and 0.1.
Packed charts were created for each sentence and
stored in memory. It is essential that the packed
charts for each sentence contain at least one deriva-
tion leading to the gold-standard dependency struc-
ture. Not all rule instantiations in CCGbank can be
produced by our parser; hence it is not possible to
produce the gold standard for every sentence in Sec-
tions 2-21. For the full-data model we used 34336
sentences (86.7% of the total). For the partial-data
models we were able to use slightly more, since the
partial structures are easier to produce. Here we
used 35,709 sentences (k = 0.85).
Since some of the packed charts are very large,
we used an 18-node Beowulf cluster, together with
a parallel version of the BFGS training algorithm.
The training time and number of iterations to con-
vergencewere172minutesand997iterationsforthe
full-data model, and 151 minutes and 861 iterations
for the partial-data model (k = 0.85). Approximate
memory usage in each case was 17.6 GB of RAM.
The dependency model uses the same set of fea-
tures described in Clark and Curran (2004b): de-
pendency features representing predicate-argument
dependencies (with and without distance measures);
rule instantiation features encoding the combining
categories together with the result category (with
and without a lexical head); lexical category fea-
tures, consisting of word–category pairs at the leaf
nodes; and root category features, consisting of
headword–category pairs at the root nodes. Further
k LP LR F CatAcc
0.99999 85.80 84.51 85.15 93.77
0.9 85.86 84.51 85.18 93.78
0.85 85.89 84.50 85.19 93.71
0.8 85.89 84.45 85.17 93.70
0.7 85.52 84.07 84.79 93.72
0.6 84.99 83.70 84.34 93.65
FullData 87.16 85.84 86.50 93.79
Random 74.63 72.53 73.57 89.31
Table 2: Accuracy of the Parser on Section 00
generalised features for each feature type are formed
by replacing words with their POS tags.
Only features which occur more than once in the
training data are included, except that the cutoff
for the rule features is 10 or more and the count-
ing is performed across all derivations licenced by
the gold-standard lexical category sequences. The
larger cutoff was used since the productivity of the
grammarcanleadtolargenumbersofthesefeatures.
The dependency model has 548590 features. In or-
dertoprovideafaircomparison, thesamefeatureset
was used for the partial-data and full-data models.
The CCG parsing consists of two phases: first the
supertagger assigns the most probable categories to
each word, and then the small number of combina-
tory rules, plus the type-changing and punctuation
rules, are used with the CKY algorithm to build a
packedchart.5 WeusethemethoddescribedinClark
and Curran (2004b) for integrating the supertagger
with the parser: initially a small number of cat-
egories is assigned to each word, and more cate-
gories are requested if the parser cannot find a span-
ning analysis. The “maximum-recall” algorithm de-
scribed in Clark and Curran (2004b) is used to find
the highest scoring dependency structure.
Table2givestheaccuracyoftheparseronSection
00 of CCGbank, evaluated against the predicate-
argument dependencies in CCGbank.6 The table
gives labelled precision, labelled recall and F-score,
and lexical category accuracy. Numbers are given
for the partial-data model with various values of k,
and for the full-data model, which provides an up-
5Gold-standard POS tags from CCGbank were used for all
the experiments in this paper.
6There are some dependency types produced by our parser
which are not in CCGbank; these were ignored for evaluation.
149
LP LR F CatAcc
k = 0.85 86.21 85.01 85.60 93.90
FullData 87.50 86.37 86.93 94.01
Table 3: Accuracy of the Parser on Section 23
k Precision Recall SentAcc
0.99999 99.71 80.16 17.48
0.9999 99.68 82.09 19.13
0.999 99.49 85.18 22.18
0.99 99.00 88.95 27.69
0.95 98.34 91.69 34.95
0.9 97.82 92.84 39.18
Table 4: Accuracy of the Partial Dependency Data
using Inside-Outside Scores
per bound for the partial-data model. We also give a
lower bound which we obtain by randomly travers-
ing a packed chart top-down, giving equal proba-
bility to each conjunctive node in an equivalence
class. The precision and recall figures are over those
sentences for which the parser returned an analysis
(99.27% of Section 00).
The best result is obtained for a k value of 0.85,
which produces partial dependency data with a pre-
cision of 99.7 and a recall of 81.3. Interestingly, the
results show that decreasing k further, which results
in partial data with a higher recall and only a slight
loss in precison, harms the accuracy of the parser.
The Random result also dispels any suspicion that
the partial-model is performing well simply because
of the supertagger; clearly there is still much work
to be done after the supertagging phase.
Table 3 gives the accuracy of the parser on Sec-
tion23, usingthebestperformingpartial-datamodel
on Section 00. The precision and recall figures are
over those sentences for which the parser returned
an analysis (99.63% of Section 23). The results
show that the partial-data model is only 1.3% F-
score short of the upper bound.
4.3 Further Experiments with Inside-Outside
In a final experiment, we attempted to exploit the
high accuracy of the partial-data model by using it
to provide new training data. For each sentence in
Section 2-21, we parsed the gold-standard lexical
category sequences and used the best performing
partial-data model to assign scores to each depen-
dency in the packed chart. The score for a depen-
dency was the sum of the probabilities of all deriva-
tions producing that dependency, which can be cal-
culated using the inside-outside algorithm. (This is
the score used by the maximum-recall parsing algo-
rithm.) Partial dependency structures were then cre-
ated by returning all dependencies whose score was
above some threshold k, as before. Table 4 gives the
accuracy of the data created by this procedure. Note
how these values differ to those reported in Table 1.
We then trained the dependency model on this
partial data using the same method as before. How-
ever, the peformance of the parser on Section 00 us-
ing these new models was below that of the previous
best performing partial-data model for all values of
k. We report this negative result because we had hy-
pothesised that using a probability model to score
the dependencies, rather than simply the number of
derivations in which they occur, would lead to im-
proved performance.
5 Conclusions
Our main result is that it is possible to train a CCG
dependency model from lexical category sequences
alone and still obtain parsing results which are only
1.3% worse in terms of labelled F-score than a
model trained on complete data. This is a notewor-
thy result and demonstrates the significant amount
of information encoded in CCG lexical categories.
The engineering implication is that, since the de-
pendency model can be trained without annotating
recursive structures, and only needs sequence in-
formation at the word level, then it can be ported
rapidly to a new domain (or language) by annotating
new sequence data in that domain.
One possible response to this argument is that,
since the lexical category sequence contains so
much syntactic information, then the task of anno-
tating category sequences must be almost as labour
intensive as annotating full derivations. To test this
hypothesis fully would require suitable annotation
tools and subjects skilled in CCG annotation, which
we do not currently have access to.
However, there is some evidence that annotat-
ing category sequences can be done very efficiently.
Clark et al. (2004) describes a porting experiment
150
in which a CCG parser is adapted for the ques-
tion domain. The supertagger component of the
parser is trained on questions annotated at the lex-
ical category level only. The training data consists
of over 1,000 annotated questions which took less
than a week to create. This suggests, as a very
rough approximation, that 4 annotators could an-
notate 40,000 sentences with lexical categories (the
size of the Penn Treebank) in a few months.
Another advantage of annotating with lexical cat-
egories is that a CCG supertagger can be used to per-
form most of the annotation, with the human an-
notator only required to correct the mistakes made
by the supertagger. An accurate supertagger can be
bootstrapped quicky, leaving only a small number of
corrections for the annotator. A similar procedure is
suggestedbyDoranetal.(1997)forportingan LTAG
grammar to a new domain.
We have a proposed a novel solution to the an-
notation bottleneck for statistical parsing which ex-
ploits the lexicalized nature of CCG, and may there-
fore be applicable to other lexicalized grammar for-
malisms such as LTAG.
References
A. Cahill, M. Burke, R. O’Donovan, J. van Genabith, and
A. Way. 2004. Long-distance dependency resolution in au-
tomatically acquired wide-coverage PCFG-based LFG ap-
proximations. In Proceedings of the 42nd Meeting of the
ACL, pages 320–327, Barcelona, Spain.
John Carroll and Ted Briscoe. 2002. High precision extrac-
tion of grammatical relations. In Proceedings of the 19th In-
ternational Conference on Computational Linguistics, pages
134–140, Taipei, Taiwan.
David Chiang. 2000. Statistical parsing with an automatically-
extracted Tree Adjoining Grammar. In Proceedings of the
38th Meeting of the ACL, pages 456–463, Hong Kong.
Stephen Clark and James R. Curran. 2003. Log-linear mod-
els for wide-coverage CCG parsing. In Proceedings of the
EMNLP Conference, pages 97–104, Sapporo, Japan.
Stephen Clark and James R. Curran. 2004a. The importance of
supertagging for wide-coverage CCG parsing. In Proceed-
ings of COLING-04, pages 282–288, Geneva, Switzerland.
Stephen Clark and James R. Curran. 2004b. Parsing the WSJ
using CCG and log-linear models. In Proceedings of the
42nd Meeting of the ACL, pages 104–111, Barcelona, Spain.
Stephen Clark, Mark Steedman, and James R. Curran. 2004.
Object-extraction and question-parsing using CCG. In
Proceedings of the EMNLP Conference, pages 111–118,
Barcelona, Spain.
C. Doran, B. Hockey, P. Hopely, J. Rosenzweig, A. Sarkar,
B. Srinivas, F. Xia, A. Nasr, and O. Rambow. 1997. Main-
taining the forest and burning out the underbrush in XTAG.
In Proceedings of the ENVGRAM Workshop, Madrid, Spain.
Julia Hockenmaier and Mark Steedman. 2002. Generative
models for statistical parsing with Combinatory Categorial
Grammar. In Proceedings of the 40th Meeting of the ACL,
pages 335–342, Philadelphia, PA.
Julia Hockenmaier. 2003. Data and Models for Statistical
Parsing with Combinatory Categorial Grammar. Ph.D. the-
sis, University of Edinburgh.
Rebbeca Hwa. 1999. Supervised grammar induction using
training data with limited constituent information. In Pro-
ceedings of the 37th Meeting of the ACL, pages 73–79, Uni-
versity of Maryland, MD.
Matthew Lease and Eugene Charniak. 2005. Parsing biomed-
ical literature. In Proceedings of the Second Interna-
tional Joint Conference on Natural Language Processing
(IJCNLP-05), Jeju Island, Korea.
Robert Malouf. 2002. A comparison of algorithms for max-
imum entropy parameter estimation. In Proceedings of the
Sixth Workshop on Natural Language Learning, pages 49–
55, Taipei, Taiwan.
Yusuke Miyao and Jun’ichi Tsujii. 2002. Maximum entropy
estimation for feature forests. In Proceedings of the Human
Language Technology Conference, San Diego, CA.
Yusuke Miyao, Takashi Ninomiya, and Jun’ichi Tsujii. 2004.
Corpus-orientedgrammar development for acquiring ahead-
driven phrase structure grammar from the Penn Treebank. In
Proceedings of the First International Joint Conference on
Natural Language Processing (IJCNLP-04), pages 684–693,
Hainan Island, China.
Jorge Nocedal and Stephen J. Wright. 1999. Numerical Opti-
mization. Springer, New York, USA.
Fernando Pereira and Yves Schabes. 1992. Inside-outside rees-
timation from partially bracketed corpora. In Proceedings of
the 30th Meeting of the ACL, pages 128–135, Newark, DE.
Stefan Riezler, Tracy H. King, Ronald M. Kaplan, Richard
Crouch, John T. Maxwell III, and Mark Johnson. 2002.
Parsing the Wall Street Journal using a Lexical-Functional
Grammar and discriminative estimation techniques. In Pro-
ceedings of the 40th Meeting of the ACL, pages 271–278,
Philadelphia, PA.
Mark Steedman, Miles Osborne, Anoop Sarkar, Stephen Clark,
RebeccaHwa, JuliaHockenmaier, PaulRuhlen, SteveBaker,
and Jeremiah Crim. 2003. Bootstrapping statistical parsers
from small datasets. In Proceedings of the 11th Conference
of the European Association for Computational Linguistics,
Budapest, Hungary.
Mark Steedman. 2000. The Syntactic Process. The MIT Press,
Cambridge, MA.
151
