What grammars tell us about corpora: 
the case of reduced relative clauses 
Paola Merlo 
LATL-University of Geneva 
IRCS 
University of Pennsylvania 
3401 Walnut St Suite 400A 
Philadelphia PA 19104-6228 
U.S.A. 
merloOlinc, cis. upenn, edu 
Abstract 
We present a large (65 million words of Wall 
Street Journal) and in-depth corpus study of 
a particular syntactic ambiguity to investigate 
(1) to what extent the structure of a grammar 
is reflected in a corpus, and (2)how proba- 
bility flmctions defined according to a gram- 
mar fit independently established measures of 
syntactic disambiguation preference. We look 
at the well-known case of the ambiguity be- 
tween a main clause and reduced relative con- 
struction. We measure the probability distri- 
butions of several linguistic features (transitiv- 
ity. tense, voice) over a sample of optionally in- 
transitive verbs. In agreement with recent re- 
suits on parsing with lexicalised probabilistic 
grammars (Collins, 1997; Srinivas, 1997), we 
find that statistics over lexical, as opposed to 
structural, features best correspond to human 
intuitive .judgments and to experimental find- 
ings. These results are enlightening to inves- 
tigate novel uses of corpora, by assessing the 
portability of statistics across tasks, and by de- 
termining what is needed for useful syntactic 
annotation of corpora. 
1 Introduction 
Most linguistic work until the 1950s studied lan- 
guage use. which required attention to detail 
and exceptions, and led to the development of 
data-driven theories and to the use of corpora 
to model naturally occurring language. Later 
on. linguists mostly studied grammars, which 
focussed on generalities and regularities, and 
led to the formulation of strong theories and to 
the study of similarity across languages. Some 
of the current "empirical" approaches integrate 
tlle corpus-based lessons with the depth of in- 
sight that the study of grammar has brought to 
the study of language. 
Suzanne Stevenson 
Dept of Computer Science 
and Center for Cognitive Science (RuCCS) 
Rutgers University 
CoRE Building, Busch Campus 
New Brunswick. NJ 08903 
U.S.A. 
suzanne~ruccs, rutgers, edu 
Empirically-induced models that learn a lin- 
guistically meaningflll grammar (Collins, 1997) 
seem to give tile best practical results in statis- 
tical natural language processing. One of the 
reasons wily these models perform so well com- 
pared to probabilistic context-free grammars is 
that they incorporate detailed lexical knowl- 
edge at all points in tile derivation (Charniak, 
1997). At the same time they perform better 
than string-based approaches because they re- 
tain structural knowledge, such as phrase struc- 
ture, subcategorization and long distance de- 
pendencies. So they are equally capable of 
modelling the fine lexical idiosyncrasies and tile 
more general syntactic regularities. 
Given an annotated training corpus, such 
methods learn its distributions (the lexical co- 
occurrences), which requires being given the 
correct space of events in the model--that is, 
the grammar--accurately enough that they can 
parse new instances of the same corpus. The 
success of such models suggests that a statisti- 
cal model nmst have access to tile appropriate 
linguistic features to make accurate predictions. 
We might want to ask the question: what 
happens if what one wants to do with anno- 
tated text is not to annotate more text. but 
to perform some other task'? Are. the same in- 
sights valid, so that annotated text can be used 
to help in other tasks, for instance generation 
or translation? Can we use annotated text to 
investigate properties of language(s) systemat- 
ically? In other words, can we use annotated 
text as a repository of information? The an- 
swer is a qualified yes. 
In this paper we look at one type of infor- 
mation that is plentiflflly present in a corpus-- 
syntactic preferences--and we argue that cor- 
pora can be very usefifl even for tasks that 
do not invoh'e parsing directly, but that mak- 
134 
ing corpora useful for other tasks might re- 
quire more a priori information than expected. 
Precisely, we ask the following question: are 
the percentages of occurrence of linguistically- 
defined units in a large corpus in accord with 
what is known about preferences for these units 
collected in other ways, such as unedited sen- 
tence production, experimental findings, or in- 
tuitive native speakers' judgments? 
This question is relevant as there is evidence 
in the literature of human parsing preferences 
that is in apparent disagreement with predic- 
tions of preferences derived from frequencies in 
a corpus (Brysbaert et al., 1998). Beside the in- 
terest ill modelling human performance (which 
is, however, not the focus of the current pa- 
per), it is important to investigate the sources 
of this disagreement between production prefer- 
mine data (frequencies in a text) and perception 
data (parsing preferences by humans), if the 
plentiflfl informatioIl stored in text is to be used 
successflfily. Distributional properties of texts if 
,mderstood, can be used to approximate resolu- 
tion of ambiguity in several tasks which involve 
deeper natural language understanding: a gen- 
eration system can use distributional properties 
to reproduce users' preference data; automatic 
translation can use monolingual distributions to 
model cross-linguistic variation accurately, and 
automatic lexical acquisition can use distribu- 
tional properties of text to bootstrap a process 
of organisation of lexical information. 
The method we use to address the question is 
as follows. We present a large in-depth corpus- 
based case study (65 million words of WSJ) to 
investigate (1) how the structure of a grammar 
is reflected in a corpus, and (2) how probabil- 
ity functions defined according to a grammar fit 
native speakers' linguistic behaviour in syntac- 
tic disambiguation. We look at the well-known 
,:;use of tile ambiguity between a main clause and 
a reduced relative construction, which arises be- 
cause regular verbs in English present an ambi- 
guity between the simple past and the past par- 
ticle (the -ed form). We measure the probability 
distributions of several linguistic features (tran- 
sitivity, tense, voice) over a sample of optionally 
intransitive verbs. We do this by hypothesizing 
and testing several probability functions over 
the sample. In agreement with recent results 
on parsing with lexicalised probabilistic gram- 
mars (Collins, 1997; Srinivas, 1997; Charniak, 
1997), our main result is that statistics over lex- 
ical features best correspond to independently 
established truman intuitive preferences and ex- 
perimental findings. 
We discuss several consequences. Method- 
ologically, this result casts light oll the relation- 
ship between different ways of collecting pref- 
erence information. It shows that some appar- 
ently contradictory results that have been dis- 
cussed ill the literature can be reconciled. The 
cruciM factor is the level of specificity one looks 
at. Theoretically, not all lexical features axe 
equally good predictors of linguistic behaviour, 
and they vary in their ability to correctly clas- 
sify linguistic phenomena. Finally, from the 
point of view of language engineering, this re- 
sults provides a strong indication on what units 
nfight port better across tasks, and what are the 
features that would be most useflfl in a syntac- 
tically annotated corpus.. 
2 Reduced Relative Clauses 
2.1 Linguistic Properties 
The following classic "'garden-path" exam- 
ple demonstrates tlm sew:re processing diffi- 
culty that can be associated with the main 
verb/reduced relatiw: (MV/RR) ambiguity 
(Bever, 1970): 
(1) The horse raced past the barn fell. 
Problems arise here because the vcrb raced can 
be interpreted as either a past tense main verb, 
or as a past participle within a reduced rela- 
tive clause (i.e., the hor.s't¢ \[that was\] raced past 
the barn). Because fell is the main verb of (1), 
the reduced relative interpretation of raced is 
required for a coherent analysis of the complete 
sentence. But the inain verb interpretation of 
raced is so strongly preferred that the human 
language processor breaks down at the verb fell, 
unable to integrate it with the interpretation 
that has been developed to that point. 
This construction is representative of the 
problem we want to address. It. is very frequent 
(MacDonald et al., 1994). hence it constitutes a 
problem that is relevant for any application. It 
is both lexically and structurally ambiguous, so 
it constitutes a hard problem. It is well-studied: 
there are plentiful data on lmman processing 
and their relation to fi'equency of the stimuli 
(MacDonald, 1994; Trueswell. 1996: Trueswell 
135 
VERB TYPE EXAMPLE .JUDGMENT 
unergative The horse raced past the barn fell hard 
unaccusative The butter melted m the pan was rancid easy 
object-drop The player kicked i:a the soccer game was angry easy 
Table 1: Processing difficulty of different classes of optionally intransitiw; verbs ac~:ording to speak- 
ers' intuitions 
et al., 1994). 
Over the last several years, it has become 
clear that not all reduced relatives are as dif- 
ficult as sentence (1) above, and that the diffi- 
culty in processing reduced relatives is directly 
linked to the lexical items in the sentence. In 
particular the difficulty appears to be related 
to the type of verb which is involved in the am- 
biguity. For the ambiguity to arise, the w.'rb 
involved--raced in this case--must be option- 
ally transitive. English has three types of op- 
tionally transitive verbs, which differ both in 
their lexical semantics and in their syntactic 
properties. 
Sentence (1) uses a manner of motion verb, 
raced. In English, these verbs form a subclass of 
mmrgative verbs (Levin and Rappaport Hovav, 
1995), intransitive action verbs that may appear 
in a transitive form: 
(2a) The horse raced past the barn. 
(2b) The rider raced the horse past the barn. 
The transitive form of an unergative (2b) is 
the causative counterpart of the intransitive 
form (2a), in which the subject of the intransi- 
tive becomes the object of the transitive (Hale 
and Keyser: 1993; Levin and Rappaport Hovav, 
1995). Sentences (3a) and (3b) use an unac- 
mlsative verb. melt: 
(3a) The butter melted in the pan. 
(3b) The cook melted the butter in the pan. 
Unaccusatives are intransitive change of state 
verbs which also have a causative transitive 
form. They differ from unergatives because 
their alternating theta role is a theme (butter), 
while for unergatives it is an agent (horse). Fi- 
nally, sentences (4a) and (4b) use. an object- 
drop verb. kicked; these verbs have a non- 
causative transitive/intransitive alternation, in 
which the object NP is optional: 
(4a) The player kicked the referee. 
(4b) The player kicked. 
2.2 Processing Difficulty 
(Stevenson and Merlo. 1997) asked naive in- 
formants for acceptability judgments on sen- 
tences with reduced relatives (RRs) contain- 
ing these verbs. They found that unergative 
verbs, such as raced or j,m,ped, unitbrmly led 
to a severe garden path in tim R R construc- 
tion. while unaccusativc vm'l)s were ()verwhehn- 
ingly judged completely fine in the R R.. with a 
few responses of them I,,~ing; slightly degraded 
They did not ask tbr .iu,lgments on (fl~ject-drop 
verbs; native speakers" intuitions are that they 
are readily interpretabh', in ;t RR. Supl)ort for 
this view comes fi'om CXl)Criments which in- 
cluded object-drop verbs, that showed that R.R.s 
are relatively easy to ,rodin'stand given a con- 
text that is not strongly I)iased toward a main 
verb reading (MacDonakl. 1994). Tlms. the dig 
ficulty of the RR intcrpret}ttion l);ttterns along 
verb class lines, with ,m,.rgatives difficult, and 
unaccusatives and obj,,,:t-drol~ vm'bs relatively 
easy. V~re summarise these results in Table 1. 
2.3 Statistical Properties 
We measured the prolmbility distrilmtions of 
several linguistic t(.';ttures (transitivity: tense, 
voice) over a sample of optionally transitive 
verbs fi'om the three lcxical semantic classes de- 
scribed above. We proceeded by hYlmthesizing 
and testing several probability flmctions over 
the sample, and proposing an ev,mt ,:lassifica- 
tion that best fits the native sp,,aker judgments 
described above. 
In our view. a grammar is a wav ,ff classifying 
elements in a language. Our sample of language 
is a text, and our grammar is the space of el- 
ementary events we define on the text. So our 
grammar is the space, of events over which we 
calculate tim probability distributions. The em- 
phasis on lexicalised grammars, both in linguis- 
tics, sentence processing and statistical NLP, 
points towards statistics ,:Omlmted at the level 
136 
RR MV Pass Act Trans hitr I PRT MV AD.\] nonAD.\] 
Unergatives .... 7 4910i 139 5330 463 5065 647 4910 21 626 
Unaccusatives 21 3321 717 393O 2402 2359 1476 3321 155 1321 
Object-drops 202 2316 1339 3074 3355 922 1939 2316 176 1719 
Table 2: Raw Counts 
of lexical items or their subfeatures. 
A probability space is a triple ~, .T', P, where 
f2 is the sample space, .T" is the event space and 
P is a function P : F ~ \[0, 1\]. In the discussion 
below, we assume 5 different probability spaces, 
in which the event space is defined by sublexical 
properties of verbs. 
First, we counted the occurrences of the verbs 
as a simple past main verb (MV) and the occur- 
rences of the verbs as a reduced relative (RR). 
Second, we counted the occurrences of the verbs 
in a transitive (TRANS) or intransitive (INTR) 
form. Third, we counted the occurrences of 
the verbs in an active (ACT) or passive (PASS) 
form. Then, we counted the occurrences of the 
verbs as a simple past main verb (MV) and 
the occurrences of the verbs as a past participle 
(PRT). These features were chosen because they 
nfinimally distinguish main clause from reduce 
relative forms. Finally, we counted how often 
the past participle form was used adjectivally. 
This last count was chosen because only cer- 
tain lexical semantic classes of verbs (excluding 
mmrgative verbs) can occur as adjectives (Levin 
and Rappaport 1986). 
Prccisely, 
7= {MV, RR}, .Tr"= {TRANS, INTR}, 
5'"={PASS, ACT}, $"'"={MV, PAT}, ~'"= 
{NON - ADJ. ADJ}. In all cases, we assume 
that the probabilities of the events are indicated 
by their relative frequency. 
We test the following hypothesis: 
H0: differences in processing pref- 
erences correspond to differences in 
the distributions of the measured vari- 
ables. 
2.3.1 Materials and Method 
We chose a set of 10 verbs from each class, 
based primarily on the classification of verbs in 
(Levin. 1993): the unergatives are manner of 
motion verbs (jumped, rushed, marched, leaped, 
floated, raced, hurried, wandered, vaulted, pa- 
faded), the unaccusatives are verbs of change 
of state (opened. ezploded, flooded, dissolved, 
cracked, hardened, boiled, melted, fractured, so- 
Iidified), and the object-drop verbs are unspec- 
ified object alternation verbs (played. painted, 
kicked, carved, reaped, washed, danced, yelled, 
typed, knitted). Each w;rb presented the same 
form in the simple past and in the past partici- 
ple, as in the MV/RR ambiguity. All verbs can 
occur in the transitiw~, and in the passive. The 
verbs in the three sets were matched pairwise 
in frequency, and their logarithmic fi'cquency 
varies between 2 and 4 ilmlusive. 
In performing this kind of corpus analysis, 
one has to take into accomlt the fact that cur- 
rent corpus annotations do not distinguish verb 
senses. The verbs in the materials were chosen 
because they did not show massive departures 
from the intended verb sense: tbr example, in 
a different study run w;u~ eliminated because 
it occurs ulost often in phrases such ;us run a 
meeting, where it is not a manner of motion use. 
However, in these comlts, we did not distinguish 
a core sense of the verl) fi'om an cxtendcd use 
of the verb. So. for instance, the sentence Con- 
sumer spending jumped 1.7 ~ in February after 
a sharp drop the month before (from Wall Street 
Journal 1987) is counted as an occurrence of 
the manner-of-motion verb j'amp in its intransi- 
tive form. This is an assmnption that is likely 
to introduce more variance than if we had only 
counted core senses of these verbs, but it is an 
unavoidable limitation at the current state of 
annotation of corpora. 
Counts were performed on the tagged ver- 
sion of the Brown Corpus and on the portion 
of the Wall Street Journal distributed by the 
ACL/DCI (years 1987, 1988, 1989), a combined 
corpus in excess of 65 million words. Five pairs 
of counts were collected, for which the raw ag- 
gregated results are shown in Table 2. First, 
each verb was counted in its main w.,rb (i.e., 
simple past) and past participle uses. based on 
137 
If _PROPERTY VERB TYPE RESULTS a main verb unerg/unacc F(18)= 4.058 p=0.059 unacc/obj-drop F(18)= 1.498 p=0.237 
Ilsimple past unerg/unacc F(18)=14.927 p=0.001 
1\[_ unacc/obj-drop F(18)= 0.317 p=0.580 
l active unerg/unacc F(18)= 9.578 p=0.006 
• unacc/obj-drop F(18)= 0.067 p=0.799 
l intransitive unerg/unacc F(18)= 7.487 p=0.014 
unacc/obj-drop F(18)= 2.514 p=0.130 
II"non-adj unerg/unacc F(14)=13.283 p=0.003 
I1 unacc/obj-drop F(14)= 0.311 p=0.586 
Table 3: Results of anovas 
the part of speech tag of the verb in the cor- 
pora. Second, active and passive uses of the 
verbs were counted: cases in which usage could 
not 1)e determined by a simple pattern search 
w,;re classified by hand. The third count also 
r,:quired manual intervention: verbs were ini- 
tially classified as transitive or intransitive ac- 
,:ording to a set of regular search patterns, then 
individual inspection of verbs was carried out to 
correct item-specific errors. In the fourth count, 
uses of the verb form as inain verb or ms reduced 
relative were collected. Reduced relatives were 
,:o,mtcd by hand after extracting fi'om the cor- 
pus all occurrences of the past participle pre- 
ceded by a noun. In the fifth count, uses of the 
verbs as prenominal adjectives were counted. 
None of the verb forms are explicitly marked as 
adjectives in these corpora. To deternfine the 
,xmnts of adjectival uses, we simply divided the 
verb occurrences labelled with the past partici- 
ple part of speech tag into prenominal and other 
uses. The only unexpected result we found was 
the occurrence of unergative adjectival forms. 
On inspection all these forms occurred with two 
verbs: hurried occurred 20 times, and rushed 
once. These were not the causative use of the 
verb. So these verbs were removed from the 
analysis of variance reported below. The unac- 
,:usatives and object-drops that were matched 
in frequency to hurried and rushed were also re- 
moved (unaccusatives: boiled, fractured; object- 
drop: danced, typed). 
2.4 Results and Discussion 
The raw aggregated data in Table 2 show 
that properties related to the main verb 
(MV) usage--intransitivity, active voice, non- 
adjectival use and simple past use. as well as the 
MV construction itself--were more frequent for 
unergatives than for unaccusatives, and more 
frequent for unaccusatives than tbr object-drop 
verbs. The mnnerical trend is in accord with 
the simplest explanation on the use of frequency 
by humans: more fl'equently occurring struc- 
tures are preferrcd over less fl'equent alterna- 
tives. However, not all numerical differences are 
significant, as indicated in Table 3. 
The data in Table 2 were entered in 10 dif- 
ferent analyses of variance on the proportion of 
cases that indicate a use of the verb as a main 
verb and its related lexical and sublexical prop- 
erties -- simple past. active, intransitive and 
non-adjectival use. Results of the ANOVAs are 
shown in Table 3. The ANOVAS were run to 
determine if verbs that belong to a class have 
a significantly different distribution than verbs 
that belong to one of the other two classes. We 
chose to perform analysis of variance because 
this test compares variance within a group to 
variance between groups, thus it is not distorted 
by the fact that there is great w~riation from 
lexical item to lexical item within each group. 
A simplified summary of the res,flts and the 
corresponding human intuitive dat}t is given in 
Table 4. 
All the data sets show the same pattern. For 
the lexical features-- simple past. active, in- 
transitive and non-adjectival use -- the differ- 
ences between the unergative and unaccusative 
distributions for each property are lfighly sig- 
nificant (p < 0.05), but the differences between 
138 
VERB TYPE JUDGMENT SIGNIFICANCE TEST 
MV/RR MV/PTR ACT/PASS INTR/TRANS, ADJ/NON-ADJ 
mmrgati~,e hard non-sig sig sig sig sig 
unaccusative !easy 
mmccusative easy non-sig non-sig non-sig nOll-sig non-sig 
object-drop easy 
Table 4: Processing difficulty of different classes of optionally intransitive verbs according to speak- 
~ws' intuitions compared to the results of significance test on pairwise comparisons of corpus data 
the unaccusative and object-drop distributions 
are not (p > 0.05). This could explain why 
the unergatives are significantly more difficult 
in the RR, while the other classes of verbs are 
not perceived as different. 
Interestingly, a more direct count of the con- 
struction itself (the MV/RR probability space) 
gives different results. Numerically, the counts 
of RR for unaccusatives arc very small, but 
native speakers do not find RR with unac- 
,:usative verbs particularly difficult. Stati.sti- 
,:ally, mmrgatives are not significantly differ- 
,mr from unaccusatives (p = 0.059), but native 
speakers find RR with unaccusative verbs con- 
siderably easier than with unergatives. 
The picture that emerges from these findings 
is coherent and in accordance with current de- 
w~lopments in statistical parsing and grammati- 
,:al theory in two important respects. First, the 
discrepancy between the frequencies of each of 
the lexical features and the frequencies of the 
m:tual construction suggests that the frequency 
of a construction is a composition fimction of (at 
least some of) its lexical features, even if such 
t};atures are not-independent. Models that can 
handle non-independent lexical features have 
given very good results both for part-of-speech 
and structural disambiguation (Ratnaparkhi, 
1996; Ratnaparkhi, 1997; Ratnaparkhi, 1998). 
Second, we observe that the lexical and sub- 
lexical features we counted are not sufficient to 
identify all the relevant linguistic classes: sta- 
tistical tests fail to differentiate between unac- 
cusatives and object-drop verbs. In order to 
distinguish between these two classes of verbs 
one needs to look at some of the surrounding 
context. This result is expected. Performance 
measures of statistical parsers show that statis- 
tics based on one word give poor results, but 
that statistics on bigrams have much better per- 
formance (Charniak, 1997). 
3 General Discussion 
3.1 Relationship between Different 
Kinds of Methods 
Our results cast some light on an important 
methodological question: can frequencies in an- 
notated corpora be considered a good approxi- 
mation of speakers' preferences? Recent results 
in the literature have argued that they cannot, 
showing large discrepancies between data col- 
lection methods (Merle, 1994), ,:omprehension 
and production (Gibson et al.. 1996), and on- 
line preferences and corpora counts (Brysbaert 
et al., 1998). Several explanations have been 
proposed, mostly dismissive of some particular 
method to collect data: tbr example: frequency- 
based preferences are not used by hmnans; the 
wi'ong frequencies had been ,:omltcd: experi- 
mental results are not representative of natural 
linguistic behaviour: or corpora are not repre- 
sentative of natural linguistic behaviom'. The 
findings in this study show a way of reconcil- 
ing results obtained by different data collection 
methods: if we count at the level of lexical and 
sublexical features, we find that differences in 
native speakers' preferences do correspond to 
significant differences in distributions. Similar 
conclusions are being reached in (Roland and 
Jurafsky, 1998), who compare different corpora. 
3.2 Classification Properties of Lexical 
Features and Consequences 
Looking at the frequencies of the Iexical features 
in Table 2, we can observe that P12T, PASS and 
TRANS have counts that can be used to di- 
rectly predict the difficulty of the 1212. construc- 
tion. This observation can be used beneficially 
in a task different fl'om parsing, for instance 
in a generation system. Some current meth- 
139 
ods have a generate and filter approach (Knight 
and Hatzivassiloglou, 1995): all constructic,ns 
are generated and then filtered based on a sl~a- 
tistical model. If the trigram model has a good 
fit with text, our experiments indicates that it 
would eliminate many RRs for unaccusatives 
that would be considered acceptable by speak- 
ers. If instead the filtering is based on, for ex- 
ample, the frequency of the past participle use, 
the system would correctly allow unaccusative 
RRs, but filter out unergative RRs. 
Moreover, we notice that all the lexical fea- 
tures reproduce the well-known relation be- 
tween markedness within a language and typol- 
ogy of languages: what is an existing but infre- 
quent construction in a few languages is absent 
in many languages. In this instance, the transi- 
tive use of manner of motion verbs -- The rider 
raced the horse -- is a marked construction in 
English, in the sense that while it is grammati- 
cal. this use is only restricted to a subset of man- 
ner of motion verbs. This construction which 
is marked in English is ungrammatical in R.o- 
lnance: languages such as Italian or French do 
not have a grammatical direct translation for 
the sentence above. This is called in the social 
sciences a zero-rare distribution, where a fea- 
ture that is generally already rare is however 
never present in one subclass of the eases. 
Interestingly, the lexical feature ADJ presents 
a distribution that reflects this cross-linguistic 
t;act internally to English: unergative verbs 
never occur prenominally, even those that can 
occur transitively, passively and in reduced rel- 
ative clauses. 
This is a particularly useful distributional cue 
tbr verb classification. On observing the com- 
plete absence of prenominal adjectives derived 
from transitive verbs one can classify the verb 
as unergative. Or, the cue provided can be used 
in a translation task: one of the typical ar- 
gument structure divergences between English 
and Romance languages can be inferred by look- 
ing at distributional data. Thus, by observing 
the absence of prenominal adjectives in English 
the translation system can avoid proposing the 
RR alternative in the target language, where it 
would be ungrammatical. 
3.3 Language Engineering 
Finally, this kind of in-depth corpus analysis 
gives us indications on what kind of syntactic 
annotation is needed ill order to lye able to use 
a corpus to perform tasks at the sentence level, 
and also, possibly, how to bootstrap a syntactic 
annotation process in a way that does not re- 
quire much in-depth semantic knowledge about 
words. 
Had we wanted to imrform ~_he study re- 
ported in this paper by simple counting of oc- 
currences in an appropriately annotated text 
-- thus eliminating the need tbr the tedious 
and time-consuming filtering of the automatic 
extraction which was necessary in the present 
study -- we would have needed a text anno- 
tated with categories deriw~d fl'mn knowledge 
about individual lexical items and a small por- 
tion of the tree surrounding them. First. all our 
comlts assumed knowledge of the verb classifi- 
cation in unergative, unaccusative and object- 
drop, which requires mmotation of the thematic 
roles of the verb. Furthermore. tin" the cmmts 
of the several variables described we needed the 
verb items and the preceding auxiliary (active- 
passive and MV/p~st l~articiple), the follow- 
ing noun phrase and knowledge about whether 
the noun phrase was the direct object of the 
verb or not (transitive-intransitive). the preced- 
ing noun phrase and knowledge about whether 
the noun phrase wm~ the subject of the verb or 
rather an adjunct head (MV/reduced relative), 
an.d the preceding deterlniner (adjective-non- 
adjective). This is evidence in favour of annota- 
tion using a lexicalised formalism, whose main 
units are argument-structure dependencies be- 
tween words, whether em:oded structurally, as 
in LTAG (Schabes and .loshi. 1991). or as gram- 
matical relations, as in dclmmlency grammar 
(Hudson, 1990: Mel'cuk. 1988). From the point 
of view of parsing, these cmmts require only one 
chunk of text each. 
As an example, consider a grammatical for- 
malisms, such as LTAG (Schabes and .loshi, 
1991), which is both lexicalised and has been 
used to chunk text without pertbrnfing a fifll 
parse. An LTAG lexicon is a tbrest of lexi- 
calised elementary trees. For verbs, the tree 
structure corresponds to their argument struc- 
ture. Thus, each of the lexical items and portion 
of tree mentioned abow,, correspond to a dif- 
ferent elementary tree. im:luding the unergative 
and unaccusative distinction, encoded by differ- 
ent labels referring to theinatic roles. Current 
140 
LTAG part-of-speech taggers, called supertag- 
gets (Joshi and Srinivas, 1994; Srinivas, 1997) 
assign a set of elementary trees to each word, in 
effect chunking the text. The counts performed 
in the study reported here would have required 
simply counting the occurrences of the labels 
assigned to the words in the text by such a su- 
pertagger. Refinements in this direction of the 
annotation of the grammar used by the XTAG 
system (Doran et al., 1994) are actually tinder 
way. 
We also can see, from the raw frequencies ob- 
tained, that when collecting counts about syn- 
tactic phenomena, corpora must be in the order 
of hundreds of millions of words for the statistics 
to be reliable. 
4 Conclusions 
Our main result in this paper is that statistics 
over lezical features best correspond to indepen- 
dently established human intuitive judgments. 
We have argued that, methodologically, this re- 
sult casts light on the relationship between dif- 
ferent data collection methods, and shows that 
some apparently contradictory results can be 
reconciled by defining probability spaces at the 
lexical and sublexical level. From the point of 
view of language engineering, we have argued 
that this result provides an indication of what 
units might reflect preferences that port across 
tasks, and what type of syntactic annotation of 
corpora is going to be most useful. 
5 Acknowledgments 
This research was partly sponsored by the Swiss 
National Science Foundation, under fellowship 
8210-46569 to P. Merlo, and by the US National 
Science Foundation, under grant #9702331 to 
S. Stevenson. We thank Aravind Joshi, Martha 
Pahner and Adwait Ratnaparkhi for useful com- 
ulents. 

References 
Thomas G. Bever. 1970. The cognitive basis for 
linguistic structure. In J. R. Hayes, editor, 
Cognition and the Development of Language. 
John Wiley, New York. 
M. Brysbaert, D.C. Mitchell, and Stefan Gron- 
delaers. 1998. Cross-linguistic differences in 
modifier attachment biases: Evidence against 
Gricean and Tuning accounts, manuscript. 
Eugene Charniak. 1997. Statistical parsing 
with a context-free gramnmr and word statis- 
tics. In Proc. of the l~th National Conference 
on AI. 
Michael Jolm Collins. 1997. Three generative, 
lexicalised models for statistical parsing. In 
Proc. of the 35th Annual Meeting of the A CL, 
pages 16-23. 
Christy Doran, Dania Egedi, Beth Aml Hockey, 
B. Srinivas, and Martin Zaidel. 1994. XTAG 
system - a wide coverage grammar for En- 
glish. In Proceedings of the 15th Interna- 
tional Conference on Conl.putational Linguis- 
tics (COLING g4); pages 922-92& Kyoto, 
Japan. 
E. Gibson: C. Schiitze. and A. Salomon. 1996. 
The relationship between the fl'equency and 
the processing complexity of linguistic struc- 
ture. J. of Psych. Research. 25(1):59-92. 
Ken Hale and Jay Keyser. 1993. On argument 
structure and the lexical representation of 
syntactic relations. In K. Hah: and J. Keyser, 
editors, The View from Bwildin~.\] 20, pages 
53-110. MIT Press. 
Richard Hudson. 1990. ET~.glish Word Gram- 
mar. Basil Blackwell. 
A. Joshi and B. Srinivas. 1994. Disaml)iguation 
of super parts of speech (Sul)ertags): Ahnost 
parsing. In Proc. of Coling 94. Kyoto, Japan. 
Kevin Knight and Vasileios Hatziva.ssiloglou. 
1995. Two-level. many-paths generation. In 
Proc. of the 3.Tth Annual ~14ceting of th.e ACL, 
pages 252-260, Cambridge.MA. 
Beth Levin and Malka Rappaport Hovav. 1995. 
Unaccusativity. MIT Press. Cambridge, MA. 
Beth Levin. 1993. English Verb Classes 
and Alternations. Chicago University Press, 
Chicago, IL. 
Maryellen C. MacDonahl, Neal .l. Pearhnutter, 
and Mark Seidenberg. 1994. The lexical na- 
ture of syntactic ambiguity resolution. Psy- 
eholo9ical Review. 
Maryellen MacDonald. 1994. Probabilistic con- 
straints and syntactic ambiguity resolution. 
Language and Cognitive Proce.~scs, 9(2):157- 
201. 
I. Mel'cuk. 1988. Dependenc.y S:qntax: Theory 
and practice. SUNY Press, All)any. 
Paola Merlo. 1994. A corpus-based analysis 
of verb continuation fiequencies for syntac- 
tic processing. Journal of Psycholinguistics 
Research, 23(6):435-457. 
Aclwait Ratnaparkhi. 1996. A maximum en- 
tropy part-of-speech tagger. In Proceedings 
off the Empirical Methods in Natural Lan- 
guage Processing Conference, Philadelphia, 
PA. University of Pennsylvania. 
A(lwait Ratnaparkhi. 1997. A linear observed 
time statistical parser based on maximum 
entropy models. In 2nd Conf. on Empirical 
Methods in NLP, pages 1-10, Providence, RI. 
Adwait Ratnaparkhi. 1998. Statistical models 
for unsupervised prepositional phrase attach- 
ment. In Proc. of the 36th Annual Meeting of 
the A CL, Montreal, CA. 
Doug Roland and Dan Jurafsky. 1998. How 
verb subcategorization frequencies are af- 
fected by corpus choice. In Proc. of the 36th 
Annual Meeting of the ACL. Montreal, CA. 
Yvc~s Schabes and Aravind Joshi. 1991. Pars- 
ing with lexicalized tree adjoining grammars. 
In Masaru Tomita, editor, Current Issues. in 
Par.sing Technology. Kluwer Academic Pub- 
lishers. 
B. Srinivas. 1997. Complezity of Lezical De- 
scriptions and its Relevance to Partial Pars- 
ing. Ph.D. thesis, University of Pennsylvania. 
Suzanne Stevenson and Paola Merlo. 1997. 
Lexical structure and processing complexity. 
Language and Cognitive Processes. 
• lohn Trueswell, Michael Tanenhaus, and Susan 
Garnsey. 1994. Semantic influences on pars- 
ing: Use of thematic role information in syn- 
tactic ambiguity resolution..lournal of Mem- 
ory and Language, 33:285-318. 
John Trueswell. 1996. The role of lexical 
frequency in syntactic ambiguity resolution. 
J. of Memory and Language, 35:566-585. 
