High Precision Extraction of Grammatical Relations
John Carroll
Cognitive and Computing Sciences
University of Sussex
Falmer, Brighton
BN1 9QH, UK
Ted Briscoe
Computer Laboratory
University of Cambridge
JJ Thomson Avenue
Cambridge CB3 0FD, UK
Abstract
A parsing system returning analyses in the form of
sets of grammatical relations can obtain high pre-
cision if it hypothesises a particular relation only
when it is certain that the relation is correct. We
operationalise this technique—in a statistical parser
using a manually-developed wide-coverage gram-
mar of English—by only returning relations that
form part of all analyses licensed by the grammar.
We observe an increase in precision from 75% to
over 90% (at the cost of a reduction in recall) on a
test corpus of naturally-occurring text.
1 Introduction1
Head-dependent relationships (possibly labelled
with a relation type) have been advocated as a use-
ful level of representation for grammatical struc-
ture in a number of different large-scale language-
processing tasks. For instance, in recent work on
statistical treebank grammar parsing (e.g. Collins,
1999) high levels of accuracy have been reached
using lexicalised probabilistic models over head-
dependent tuples. Bouma, van Noord and Mal-
ouf (2001) create dependency treebanks semi-auto-
matically in order to induce dependency-based sta-
tistical models for parse selection. Lin (1998),
Srinivas (2000) and others have evaluated the ac-
curacy of both phrase structure-based and depen-
dency parsers by matching head-dependent rela-
tions against ‘gold standard  relations, rather than
matching (labelled) phrase structure bracketings.
Research on unsupervised acquisition of lexical in-
formation from corpora, such as argument structure
of predicates (Briscoe and Carroll, 1997; McCarthy,
2000), word classes for disambiguation (Clark and
Weir, 2001), and collocations (Lin 1999), has used
grammatical relation/head/dependent tuples. Such
1A previous version of this paper was presented at
IWPT 01; this version contains new experiments and results.
tuples also constitute a convenient intermediate rep-
resentation in applications such as information ex-
traction (Palmer et al., 1993; Yeh, 2000), and docu-
ment retrieval on the Web (Grefenstette, 1997).
A variety of different approaches have been taken
for robust extraction of relation/head/dependent tu-
ples, or grammatical relations, from unrestricted
text. Dependency parsing is a natural technique
to use, and there has been some work in that area
on robust analysis and disambiguation (e.g. Laf-
ferty, Sleator and Temperley, 1992; Srinivas, 2000).
Finite-state approaches (e.g. Karlsson et al., 1995;
A¨ıt-Mokhtar and Chanod, 1997; Grefenstette, 1998)
have used hand-coded transducers to recognise lin-
ear configurations of words and part of speech la-
bels associated with, for example, subject/object-
verb relationships. An intermediate step may be to
mark nominal, verbal etc. ‘chunks  in the text and to
identify the head word of each of the chunks. Sta-
tistical finite-state approaches have also been used:
Brants, Skut and Krenn (1997) train a cascade of
Hidden Markov Models to tag words with their
grammatical functions. Approaches based on mem-
ory based learning have also used chunking as a
first stage, before assigning grammatical relation la-
bels to heads of chunks (Argamon, Dagan and Kry-
molowski, 1998; Buchholz, Veenstra and Daele-
mans, 1999). Blaheta and Charniak (2000) assume
a richer input representation consisting of labelled
trees produced by a treebank grammar parser, and
use the treebank again to train a further procedure
that assigns grammatical function tags to syntac-
tic constituents in the trees. Alternatively, a hand-
written grammar can be used that produces ‘shal-
low  and perhaps partial phrase structure analyses
from which grammatical relations are extracted (e.g.
Carroll, Minnen and Briscoe, 1998; Lin, 1998).
Recently, Schmid and Rooth (2001) have de-
scribed an algorithm for computing expected gov-
ernor labels for terminal words in labelled headed
parse trees produced by a probabilistic context-free
grammar. A governor label (implicitly) encodes a
grammatical relation type (such as subject or ob-
ject) and a governing lexical head. The labels are
expected in the sense that each is weighted by the
sum of the probabilities of the trees giving rise to
it, and are computed efficiently by processing the
entire parse forest rather than individual trees. The
set of terminal/relation/governing-head tuples will
not typically constitute a globally coherent analy-
sis, but may be useful for interfacing to applications
that primarily accumulate fragments of grammati-
cal information from text (such as for instance in-
formation extraction, or systems that acquire lexical
data from corpora). The approach is not so suit-
able for applications that need to interpret complete
and consistent sentence structures (such as the anal-
ysis phase of transfer-based machine translation).
Schmid and Rooth have implemented the algorithm
for parsing with a lexicalised probabilistic context-
free grammar of English and applied it in an open
domain question answering system, but they do not
give any practical results or an evaluation.
In the paper we investigate empirically Schmid
and Rooth s proposals, using a wide-coverage pars-
ing system applied to a test corpus of naturally-
occurring text, extend it with various thresholding
techniques, and observe the trade-off between pre-
cision and recall in grammatical relations returned.
Using the most conservative threshold results in a
parser that returns only grammatical relations that
form part of all analyses licensed by the grammar.
In this case, precision rises to over 90%, as com-
pared with a baseline of 75%.
2 The Analysis System
In this investigation we extend a statistical shallow
parsing system for English developed originally by
Carroll, Minnen and Briscoe (1998). Briefly, the
system works as follows: input text is labelled with
part-of-speech (PoS) tags by a tagger, and these
are parsed using a wide-coverage unification-based
‘phrasal  grammar of English PoS tags and punctu-
ation. For disambiguation, the parser uses a prob-
abilistic LR model derived from parse tree struc-
tures in a treebank, augmented with a set of lexical
entries for verbs, acquired automatically from a 10
million word sample of the British National Corpus
(Leech, 1992), each entry containing subcategori-
sation frame information and an associated proba-
bility. The parser is therefore ‘semi-lexicalised  in
that verbal argument structure is disambiguated lex-
ically, but the rest of the disambiguation is purely
structural.
The coverage of the grammar—the proportion of
sentences for which at least one complete spanning
analysis is found—is around 80% when applied to
the SUSANNE corpus (Sampson, 1995). In addition,
the system is able to perform parse failure recov-
ery, finding the highest scoring sequence of phrasal
fragments (following the approach of Kiefer et al.,
1999), and the system has produced at least partial
analyses for over 98% of the sentences in the written
part of the British National Corpus.
The parsing system reads off grammatical rela-
tion tuples (GRs) from the constituent structure tree
that is returned from the disambiguation phase. In-
formation is used about which grammar rules in-
troduce subjects, complements, and modifiers, and
which daughter(s) is/are the head(s), and which the
dependents. In Carroll et al. s evaluation the system
achieves GR accuracy that is comparable to pub-
lished results for other systems: extraction of non-
clausal subject relations with 83% precision, com-
pared with Grefenstette s (1998) figure of 80%; and
overall F-score2 of unlabelled head-dependent pairs
of 80%, as opposed to Lin s (1998) 83%3 and Srini-
vas s (2000) 84% (this with respect only to binary
relations, and omitting the analysis of control rela-
tionships). Blaheta and Charniak (2000) report an
F-score of 87% for assigning grammatical function
tags to constituents, but the task, and therefore the
scoring method, is rather different.
For the work reported in this paper we have ex-
tended Carroll et al. s basic system, implementing
a version of Schmid and Rooth s expected gover-
nor technique (see section 1 above) but adapted for
unification-based grammar and GR-based analyses.
Each sentence is analysed as a set of weighted GRs
where the weight associated with each grammati-
cal relation is computed as the sum of the proba-
bilities of the parses that relation was derived from,
divided by the sum of the probabilities of all parses.
So, if we assume that Schmid and Rooth s example
sentence Peter reads every paper on markup has 2
parses, one where on markup attaches to the preced-
ing noun having overall probability a0a2a1a3a0a4a0a6a5 and the
other where it has verbal attachment with probabil-
ity a0a2a1a3a0a4a0a8a7 , then some of the weighted GRs would be
2We use the F
a9 measure defined as a10a12a11a12a13a15a14a17a16a19a18a21a20a23a22a24a20a26a25a28a27a29a11
a14a30a16a31a18a21a32a34a33a35a33a37a36a34a38a39a13a40a14a17a16a19a18a41a20a42a22a24a20a26a25a30a27a44a43a45a14a30a16a19a18a41a32a34a33a35a33a37a46 .
3Our calculation, based on table 2 of Lin (1998).
1.0 ncsubj(reads, Peter, )
0.7 ncmod(on, paper, markup)
0.3 ncmod(on, reads, markup)
Figure 1 contains a more extended example of a
weighted GR analysis for a short sentence from the
SUSANNE corpus, and also gives a flavour of the re-
lation types that the system returns. The GR scheme
is decribed in detail by Carroll, Briscoe and Sanfil-
ippo (1998).
3 Empirical Results
3.1 Weight Thresholding
Our first experiment compared the accuracy of the
parser when extracting GRs from the highest ranked
analysis (the standard probabilistic parsing setup)
against extracting weighted GRs from all parses in
the forest. To measure accuracy we use the pre-
cision, recall and F-score measures of parser GRs
against ‘gold standard  GR annotations in a 10,000-
word test corpus of in-coverage sentences derived
from the SUSANNE corpus and covering a range of
written genres4. GRs are in general compared us-
ing an equality test, except that in a specific, limited
number of cases (described by Carroll, Minnen and
Briscoe, 1998) the parser is allowed to return more
generic relation types.
When a parser GR has a weight of less than one,
we proportionally discount its contribution to the
precision and recall scores. Thus, given a set a47
of GRs with associated weights produced by the
parser, i.e.
a47 a48 a49a51a50a23a52a54a53a24a55a19a56a21a53a58a57a60a59a21a52a54a53a62a61a64a63a66a65a31a67a69a68a71a70a71a68a72a61a37a73a8a67a6a65a75a74a4a63a31a63a19a76a2a77a78a61a26a74a15a65a19a68a72a79
a70a80a61a37a65a31a67a82a81a84a83a85a56a41a53a21a55a19a70a80a67a86a68a78a87a31a68
a0a89a88
a52a54a53a91a90a93a92a40a94
and a set a95 of gold-standard (unweighted) GRs, we
compute the weighted match between a95 and the el-
ements of a47 as
a96
a48 a97
a98a100a99a102a101a104a103a105a26a101a23a106a104a107a34a108
a52 a53a4a109 a50a23a56 a53a75a110 a95a111a57
where a109 a50a23a112a113a57a114a48a115a92 if a112 is true and a0 otherwise. The
weighted precision and recall are then
a96
a116
a98a117a99a102a101a118a103a105a26a101a42a106a104a107a119a108
a52a54a53
a74a4a120a86a79
a96
a59a39a95a80a59
respectively, expressed as percentages. We are
not aware of any previous published work using
4At http://www.cogs.susx.ac.uk/lab/nlp/carroll/greval.html.
Table 1: GR accuracy comparing extraction from
just the highest-ranked parse compared to weighted
GR extraction from all parses.
Precision (%) Recall (%) F-score
Best parse 76.25 76.77 76.51
All parses 74.63 75.33 74.98
weighted precision and recall measures, although
there is an option for associating weights with com-
plete parses in the distributed software implement-
ing the PARSEVAL scheme (Harrison et al., 1991)
for evaluating parser accuracy with respect to phrase
structure bracketings. The weighted measures make
sense for application tasks that can deal with sets of
mutually-inconsistent GRs.
In this initial experiment, precision and recall
when extracting weighted GRs from all parses were
both one and a half percentage points lower than
when GRs were extracted from just the highest
ranked analysis (see table 1)5. This decrease in
accuracy might be expected, though, given that a
true positive GR may be returned with weight less
than one, and so will not receive full credit from the
weighted precision and recall measures.
However, these results only tell part of the story.
An application using grammatical relation analyses
might be interested only in GRs that the parser is
fairly confident of being correct. For instance, in un-
supervised acquisition of lexical information (such
as subcategorisation frames for verbs) from text, the
usual methodology is to (partially) analyse the text,
retaining only reliable hypotheses which are then
filtered based on the amount of evidence for them
over the corpus as a whole. Thus, Brent (1993)
only creates hypotheses on the basis of instances
of verb frames that are reliably and unambiguously
cued by closed class items (such as pronouns) so
there can be no other attachment possibilities. In re-
cent work on unsupervised learning of prepositional
phrase disambiguation, Pantel and Lin (2000) derive
training instances only from relevant data appearing
in syntactic contexts that are guaranteed to be unam-
biguous. In our system, the weights on GRs indicate
how certain the parser is of the associated relations
being correct. We therefore investigated whether
more highly weighted GRs are in fact more likely
5Ignoring the weights on GRs, standard (unweighted) eval-
uation results for all parses are: precision 36.65%, recall
89.42% and F-score 51.99.
1.0 aux( , continue, will) 0.4490 iobj(on, place, tax-payers)
1.0 detmod( , burden, a) 0.3276 ncmod(on, burden, tax-payers)
1.0 dobj(do, this, ) 0.2138 ncmod(on, place, tax-payers)
1.0 dobj(place, burden, ) 0.0250 xmod(to, continue, place)
1.0 ncmod( , burden, disproportionate) 0.0242 ncmod( , Fulton, tax-payers)
1.0 ncsubj(continue, Failure, ) 0.0086 obj2(place, tax-payers)
1.0 ncsubj(place, Failure, ) 0.0086 ncmod(on, burden, Fulton)
1.0 xcomp(to, Failure, do) 0.0020 mod( , continue, place)
0.9730 clausal(continue, place) 0.0010 ncmod(on, continue, tax-payers)
0.9673 ncmod( , tax-payers, Fulton)
Figure 1: Weighted GRs for the sentence Failure to do this will continue to place a disproportionate burden
on Fulton taxpayers.
100
75
50
Recall
(%)
50 75 100
Precision (%)
a121
a121
a121
a122
a122
Threshold=0a123
a123
a123a124
Threshold=1 a125
a125
a125
a125a69a126
Figure 2: Weighted GR accuracy as the threshold is
varied.
to be correct than ones with lower weights. We did
this by setting a threshold on the output, such that
any GR with weight lower than the threshold is dis-
carded.
Figure 2 plots weighted recall and precision as
the threshold is varied between zero and one The
results are intriguing. Precision increases monoton-
ically from 74.6% at a threshold of zero (the situ-
ation as in the previous experiment where all GRs
extracted from all parses in the forest are returned)
to 90.4% at a threshold of one. (The latter thresh-
old has the effect of allowing only those GRs that
form part of every single analysis to be returned).
The influence of the threshold on recall is equally
dramatic, although since we have not escaped the
usual trade-off with precision the results are some-
what less positive. Recall decreases from 75.3%
to 45.2%, initially rising slightly, then falling at a
gradually increasing rate. Between thresholds 0.99
and 1.0 there is only a two percentage point differ-
ence in precision, but recall differs by almost four-
teen percentage points6. Over the whole range, as
the threshold is increased from zero, precision rises
faster than recall falls until the threshold reaches
0.65; here the F-score attains its overall maximum
of 77.
It turns out that the eventual figure of over 90%
precision is not due to ‘easier  relation types (such
as the dependency between a determiner and a
noun) being returned and more difficult ones (for
example clausal complements) being ignored. The
majority of relation types are produced with fre-
quency consistent with the overall 45% recall fig-
ure. Exceptions are arg mod (encoding the English
passive ‘by-phrase ) and iobj (indirect object), for
which no GRs at all are produced. The reason for
this is that both types of relation originate from
an occurrence of a prepositional phrase in contexts
where it could be either a modifier or a complement
of a predicate. This pervasive ambiguity means that
there will always be disagreement between analyses
over the relation type (but not necessarily over the
identity of the head and dependent themselves).
3.2 Parse Unpacking
Schmid and Rooth s algorithm computes expected
governors efficiently by using dynamic program-
ming and processing the entire parse forest rather
than individual trees. In contrast, we unpack the
whole parse forest and then extract weighted GRs
from each tree individually. Our implementation
is certainly less elegant, but in practical terms for
6Roughly, each percentage point increase or decrease in
precision and recall is statistically significant at the 95% level.
In this and all significance tests in this paper we use a one-tailed
paired t-test (with 499 degrees of freedom).
sentences where there are relatively small numbers
of parses the speed is still acceptable. However,
throughput goes down linearly with the number
of parses, and when there are many thousands of
parses—and particularly also when the sentence is
long and so each tree is large—the parsing system
becomes unacceptably slow.
One possibility to improve the situation would be
to extract GRs directly from forests. At first glance
this looks a possibility: although our parse forests
are produced by a probabilistic LR parser using a
unification-based grammar, they are similar in con-
tent to those computed by a probabilistic context-
free grammar, as assumed by Schmid and Rooth s
algorithm. However, there are problems. If the test
for being able to pack local ambiguities in the unifi-
cation grammar parse forest is feature structure sub-
sumption, unpacking a parse apparently encoded in
the forest can fail due to non-local inconsistency in
feature values (Oepen and Carroll, 2000)7, so every
governor tuple hypothesis would have to be checked
to ensure that the parse it came from was globally
valid. It is likely that this verification step would
cancel out the efficiency gained from using an algo-
rithm based on dynamic programming. This prob-
lem could be side-stepped (but at the cost of less
compact parse forests) by instead testing for feature
structure equivalence rather than subsumption. A
second, more serious problem is that some of our re-
lation types encode more information than is present
in a single governor tuple (the non-clausal subject
relation, for instance, encoding whether the surface
subject is the ‘deep  object in a passive construc-
tion); this information can again be less local and
violate the conditions required for the dynamic pro-
gramming approach.
Another possibility is to compute only the a127 high-
est ranked parses and extract weighted GRs from
just those. The basic case where a127a128a48a129a92 is equivalent
to the standard approach of computing GRs from
the highest probability parse. Table 2 shows the ef-
fect on accuracy as a127 is increased in stages to a92 a0a4a0a4a0 ,
using a threshold for GR extraction of a92 ; also shown
is the previous setup (labelled ‘unlimited ) in which
all parses in the forest are considered.8 (All differ-
ences in precision in the table are significant to at
least the 95% level, except between a92 a0a4a0a4a0 parses and
7The forest therefore also ‘leaks  probability mass since it
contains derivations that are in fact not legal.
8At
a27a131a130a133a132a24a134a17a134a17a134 parses, the (unlabelled) weighted precision
of head-dependent pairs is 91.0%.
Table 2: Weighted GR accuracy using a threshold
of 1, with respect to the maximum number of
ranked parses considered.
Maximum Precision Recall F-score
Parses (%) (%)
1 76.25 76.77 76.51
2 80.15 73.30 76.57
5 84.94 67.03 74.93
10 86.73 62.47 72.63
100 89.59 51.45 65.36
1000 90.24 46.08 61.00
unlimited 90.40 45.21 60.27
an unlimited number). The results demonstrate that
limiting processing to a relatively small, fixed num-
ber of parses—even as low as 100—comes within
a small margin of the accuracy achieved using the
full parse forest. These results are striking, in view
of the fact that the grammar assigns more than a7a15a0a4a0
parses to over a third of the sentences in the test
corpus, and more than a thousand parses to a fifth of
them. Another interesting observation is that the re-
lationship between precision and recall is very close
to that seen when the threshold is varied (as in the
previous section); there appears to be no loss in re-
call at a given level of precision. We therefore feel
confident in unpacking a limited number of parses
from the forest and extracting weighted GRs from
them, rather than trying to process all parses. We
have tentatively set the limit to be a92 a0a4a0a4a0 , as a reason-
able compromise in our system between throughput
and accuracy.
3.3 Parse Weighting
The way in which the GR weighting is carried out
does not matter when the weight threshold is equal
to 1 (since then only GRs that are part of every anal-
ysis are returned, each with a weight of one). How-
ever, we wanted to see whether the precise method
for assigning weights to GRs has an effect on accu-
racy, and if so, to what extent. We therefore tried an
alternative approach where each GR receives a con-
tribution of 1 from every parse, no matter what the
probability of the parse is, normalising in this case
by the number of parses considered. This tends to
increase the numbers of GRs returned for any given
threshold, so when comparing the two methods we
found thresholds such that each method obtained the
same precision figure (of roughly 83.38%). We then
compared the recall figures (see table 3). The recall
Table 3: Accuracy at the same level of precision us-
ing different weighting methods, with a 1000-parse
tree limit.
Weighting Precision Recall F-score
Method (%) (%)
Probabilistic (at 88.38 59.19 70.90
threshold 0.99)
Equally (at 88.39 55.17 67.94
threshold 0.768)
for the probabilistic weighting scheme is 4% higher
(statistically significant at the 99.95% level).
3.4 Maximal Consistent Relation Sets
It is interesting to see what happens if we com-
pute for each sentence the maximal consistent set of
weighted GRs. (We might want to do this if we want
complete and coherent sentence analyses, interpret-
ing the weights as confidence measures over sub-
analysis segments). We use a ‘greedy  algorithm to
compute consistent relation sets, taking GRs sorted
in order of decreasing weight and adding a GR to
the set if and only if there is not already a GR in
the set with the same dependent. (But note that
the correct analysis may in fact contain more than
one GR with the same dependent, such as the nc-
subj ... Failure GRs in Figure 1, and in these cases
this method will introduce errors). The weighted
precision, recall and F-score at threshold zero are
79.31%, 73.56% and 76.33 respectively. Precision
and F-score are significantly better (at the 95.95%
level) than the baseline.
3.5 Parser Bootstrapping
One of our primary research goals is to explore un-
supervised acquisition of lexical knowledge. The
parser we use in this work is ‘semi-lexicalised ,
using subcategorisation probabilities for verbs ac-
quired automatically from (unlexicalised) parses. In
the future we intend to acquire other types of lexico-
statistical information (for example on PP attach-
ment) which we will feed back into the parser s dis-
ambiguation procedure, bootstrapping successively
more accurate versions of the parsing system. There
is still plenty of scope for improvement in accu-
racy, since compared with the number of correct
GRs in top-ranked parses there are roughly a fur-
ther 20% that are correct but present only in lower-
ranked parses. There appears to be less room for
improvement with argument relations (ncsubj, dobj
etc.) than with modifier relations (ncmod and sim-
ilar). This indicates that our next efforts should be
directed to collecting information on modification.
4 Discussion and Further Work
We have extended a shallow parsing system for En-
glish that returns analyses in the form of sets of
grammatical relations, presenting an investigation
into the extraction of weighted relations from prob-
abilistic parses. We observed that setting a thresh-
old on the output such that any relation with weight
lower than the threshold is discarded allows a trade-
off to be made between recall and precision, and
found that by setting the threshold at 1 the preci-
sion of the system was boosted dramatically, from
a baseline of 75% to over 90%. With this setting,
the system returns only relations that form part of
all analyses licensed by the grammar: the system
can have no greater certainty that these relations are
correct, given the knowledge that is available to it.
Although we believe this technique to be well
suited to probabilistic parsers, it could also poten-
tially benefit any parsing system that can repre-
sent ambiguity and return analyses that are com-
posed of a collection of elementary units. Such
a system need not necessarily be statistical, since
parse probabilities make no difference when check-
ing that a given sub-analysis segment forms part
of all possible global analyses. Moreover, a non-
statistical parsing system could use the the tech-
nique to construct a reliable annotated corpus au-
tomatically, which it could then be trained on.
Acknowledgements
We are grateful to Mats Rooth for early discus-
sions about his expected governor label work. This
research was supported by UK EPSRC projects
GR/N36462/93 ‘Robust Accurate Statistical Parsing
(RASP)  and by EU FP5 project IST-2001-34460
‘MEANING: Developing Multilingual Web-scale
Language Technologies .

References

A¨ıt-Mokhtar, S. and J-P. Chanod (1997) Subject and ob-
ject dependency extraction using finite-state transduc-
ers. In Proceedings of the ACL/EACL 97 Workshop
on Automatic Information Extraction and Building of
Lexical Semantic Resources, 71-77. Madrid, Spain.

Argamon, S., I. Dagan and Y. Krymolowski (1998) A
memory-based approach to learning shallow natural
language patterns. In Proceedings of the 36th An-
nual Meeting of the Association for Computational
Linguistics, 67-73. Montreal.

Blaheta, D. and E. Charniak (2000) Assigning function
tags to parsed text. In Proceedings of the 1st Con-
ference of the North American Chapter of the ACL,
234-240. Seattle, WA.

Bouma, G., G. van Noord and R. Malouf (2001)
Alpino: wide-coverage computational analysis of
Dutch. Computational Linguistics in the Netherlands
2000. Selected Papers from the 11th CLIN Meeting.

Brants, T., W. Skut and B. Krenn (1997) Tagging gram-
matical functions. In Proceedings of the 2nd Confer-
ence on Empirical Methods in Natural Language Pro-
cessing, 64-74. Providence, RI.

Brent, M. (1993) From grammar to lexicon: unsuper-
vised learning of lexical syntax. Computational Lin-
guistics, 19(3), 243-262.

Briscoe, E. and J. Carroll (1997) Automatic extraction
of subcategorization from corpora. In Proceedings of
the 5th ACL Conference on Applied Natural Language
Processing, 356-363. Washington, DC.

Buchholz, S., J. Veenstra and W. Daelemans (1999) Cas-
caded grammatical relation assignment. In Proceed-
ings of the Joint SIGDAT Conference on Empirical
Methods in Natural Language Processing and Very
Large Corpora, College Park, MD. 239-246.

Carroll, J., E. Briscoe and A. Sanfilippo (1998) Parser
evaluation: a survey and a new proposal. In Proceed-
ings of the 1st International Conference on Language
Resources and Evaluation, 447-454. Granada, Spain.

Carroll, J., G. Minnen and E. Briscoe (1998) Can sub-
categorisation probabilities help a statistical parser?.
In Proceedings of the 6th ACL/SIGDAT Workshop on
Very Large Corpora. Montreal, Canada.

Clark, S. and D. Weir (2001) Class-based probability es-
timation using a semantic hierarchy. In Proceedings
of the 2nd Conference of the North American Chapter
of the ACL. Pittsburgh, PA.

Collins, M. (1999) Head-driven statistical models for
natural language parsing. PhD thesis, University of
Pennsylvania.

Grefenstette, G. (1997) SQLET: Short query linguistic
expansion techniques, palliating one-word queries by
providing intermediate structure to text. In Proceed-
ings of the RIAO 97, 500-509. Montreal, Canada.

Grefenstette, G. (1998) Light parsing as finite-state filter-
ing. In A. Kornai (Eds.), Extended Finite State Models
of Language. Cambridge University Press.

Harrison, P., S. Abney, E. Black, D. Flickinger, C.
Gdaniec, R. Grishman, D. Hindle, B. Ingria, M. Mar-
cus, B. Santorini, & T. Strzalkowski (1991) Evalu-
ating syntax performance of parser/grammars of En-
glish. In Proceedings of the ACL 91 Workshop on
Evaluating Natural Language Processing Systems,
71-78. Berkeley, CA.

Karlsson, F., A. Voutilainen, J. Heikkil¨a and A. Anttila
(1995) Constraint Grammar: a Language-Independ-
ent System for Parsing Unrestricted Text. Berlin, Ger-
many: de Gruyter.

Kiefer, B., H-U. Krieger, J. Carroll and R. Malouf (1999)
A bag of useful techniques for efficient and robust
parsing. In Proceedings of the 37th Annual Meeting of
the Association for Computational Linguistics, 473-
480. University of Maryland.

Lafferty, J., D. Sleator and D. Temperley (1992) Gram-
matical trigrams: A probabilistic model of link gram-
mar. In Proceedings of the AAAI Fall Symposium on
Probabilistic Approaches to Natural Language, 89-
97. Cambridge, MA.

Leech, G. (1992) 100 million words of English: the
British National Corpus. Language Research, 28(1),
1-13.

Lin, D. (1998) Dependency-based evaluation of MINI-
PAR. In Proceedings of the The Evaluation of Pars-
ing Systems: Workshop at the 1st International
Conference on Language Resources and Evaluation.
Granada, Spain (also available as University of Sus-
sex technical report CSRP-489).

Lin, D. (1999) Automatic identification of non-
compositional phrases. In Proceedings of the 37th
Annual Meeting of the Association for Computational
Linguistics, 317-324. College Park, MD.

McCarthy, D. (2000) Using semantic preferences to
identify verbal participation in role switching alter-
nations. In Proceedings of the 1st Conference of the
North American Chapter of the ACL, 256-263. Seat-
tle, WA.

Oepen, S. and J. Carroll (2000) Ambiguity packing in
constraint-based parsing practical results. In Pro-
ceedings of the 1st Conference of the North American
Chapter of the ACL, 162-169. Seattle, WA.

Palmer, M., R. Passonneau, C. Weir and T. Finin (1993)
The KERNEL text understanding system. Artificial
Intelligence, 63, 17-68.

Pantel, P. and D. Lin (2000) An unsupervised approach
to prepositional phrase attachment using contextually
similar words. In Proceedings of the 38th Annual
Meeting of the Association for Computational Lin-
guistics, 101-108. Hong Kong.

Sampson, G. (1995) English for the Computer. Oxford
University Press.

Schmid, H. and M. Rooth (2001) Parse forest computa-
tion of expected governors. In Proceedings of the 39th
Annual Meeting of the Association for Computational
Linguistics, 458-465. Toulouse, France.

Srinivas, B. (2000) A lightweight dependency analyzer
for partial parsing. Natural Language Engineering,
6(2), 113-138.

Yeh, A. (2000) Using existing systems to supplement
small amounts of annotated grammatical relations
training data. In Proceedings of the 38th Annual Meet-
ing of the Association for Computational Linguistics,
126-132. Hong Kong.
