Detecting Verbal Participation in Diathesis Alternations 
Diana McCarthy 
Cognitive & Computing Sciences, 
University of Sussex 
Brighton BN1 9QH, UK 
Anna Korhonen 
Computer Laboratory, 
University of Cambridge, Pembroke Street, 
Cambridge CB2 3QG, UK 
Abstract 
We present a method for automatically identi- 
fying verbal participation in diathesis alterna- 
tions. Automatically acquired subcategoriza- 
tion frames are compared to a hand-crafted clas- 
sification for selecting candidate verbs. The 
minimum description length principle is then 
used to produce a model and cost for storing the 
head noun instances from a training corpus at 
the relevant argument slots. Alternating sub- 
categorization frames are identified where the 
data from corresponding argument slots in the 
respective frames can be combined to produce 
a cheaper model than that produced if the data 
is encoded separately. I. 
1 Introduction 
Diathesis alternations are regular variations in 
the syntactic expressions of verbal arguments, 
for example The boy broke the window ~-. The 
window broke. Levin's (1993) investigation of 
alternations summarises the research done and 
demonstrates the utility of alternation informa- 
tion for classifying verbs. Some studies have re- 
cently recognised the potential for using diathe- 
sis alternations within automatic lexical acquisi- 
tion (Ribas, 1995; Korhonen, 1997; Briscoe and 
Carroll, 1997). 
This paper shows how corpus data can be 
used to automatically detect which verbs un- 
dergo these alternations. Automatic acquisi- 
tion avoids the costly overheads of a manual 
approach and allows for the fact that pred- 
icate behaviour varies between sublanguages, 
domains and across time. Subcategorization 
frames (SCFs) are acquired for each verb and 
1This work was partially funded by CEC LE1 project 
"SPARKLE". We also acknowledge support from UK 
EPSRC project "PSET: Practical Simplification of En- 
glish Text". 
a hand-crafted classification of diathesis alter- 
nations filters potential candidates with the 
correct SCFs. Models representing the selec- 
tional preferences of each verb for the argument 
slots under consideration are then used to indi- 
cate cases where the underlying arguments have 
switched position in alternating SCFs. The se- 
lectional preferences models are produced from 
argument head data stored specific to SCF and 
slot. 
The preference models are obtained using the 
minimum description length (MDL) principle. 
MDL selects an appropriate model by compar- 
ing potential candidates in terms of the cost of 
storing the model and the data stored using that 
model for each set of argument head data. We 
compare the cost of representing the data at al- 
ternating argument slots separately with that 
when the data is combined to indicate evidence 
for participation in an alternation. 
2 SCF Identification 
The SCFs applicable to each verb are extracted 
automatically from corpus data using the sys- 
tem of Briscoe and Carroll (1997). This compre- 
hensive verbal acquisition system distinguishes 
160 verbal SCFs. It produces a lexicon of verb 
entries each organised by SCF with argument 
head instances enumerated at each slot. 
The hand-crafted diathesis alternation clas- 
sification links Levin's (1993) index of alterna- 
tions with the 160 SCFs to indicate which classes 
are involved in alternations. 
3 Selectional Preference Acquisition 
Selectional preferences can be obtained for the 
subject, object and prepositional phrase slots 
for any specified SCF classes. The input data 
includes the target verb, SCF and slot along 
with the noun frequency data and any prepo- 
1493 
sition (for PPs). Selectional preferences are 
represented as Association Tree Cut Models 
(ATCMS) aS described by Abe and Li (1996). 
These are sets of classes which cut across the 
WordNet hypernym noun hierarchy (Miller et 
al., 1993) covering all leaves disjointly. Associ- 
ation scores, given by ~ are calculated for p(c) ' 
the classes. These scores are calculated from 
the frequency of nouns occurring with the tar- 
get verb and irrespective of the verb. The score 
indicates the degree of preference between the 
class (c) and the verb (v) at the specified slot. 
Part of the ATCM for the direct object slot of 
build is shown in Figure 1. For another verb a 
different level for the cut might be required. For 
example eat might require a cut at the FOOD 
hyponym of OBJECT. 
Finding the best set of classes is key to ob- 
taining a good preference model. Abe and Li 
use MDL to do this. MDL is a principle from in- 
formation theory (Rissanen, 1978) which states 
that the best model minimises the sum of i the 
number of bits to encode the model, and ii the 
number of bits to encode the data in the model. 
This makes the compromise between a simple 
model and one which describes the data effi- 
ciently. 
Abe and Li use a method of encoding tree cut 
models using estimated frequency and probabil- 
ity distributions for the data description length. 
The sample size and number of classes in the 
cut are used for the model description length. 
They provide a way of obtaining the ATCMS us- 
ing the identity p(clv ) = A(c, v) × p(c). Initially 
a tree cut model is obtained for the marginal 
probability p(c) for the target slot irrespective 
of the verb. This is then used with the condi- 
tional data and probability distribution p(clv ) 
to obtain an ATCM aS a by-product of obtaining 
the model for the conditional data. The actual 
comparison used to decide between two cuts is 
calculated as in equation 1 where C represents 
the set of classes on the cut model currently 
being examined and Sv represents the sample 
specific to the target verb. 2. 
IClloglSvl +  -freqc x log P(ClV) (1) 
2 o,c p(c) 
In determining the preferences the actual en- 
SAil logarithms are to the base 2 
\[~" } 
Figure 1: ATCM for build Object slot 
coding in bits is not required, only the relative 
cost of the cut models being considered. The 
WordNet hierarchy is searched top down to find 
the best set of classes under each node by locally 
comparing the description length at the node 
with the best found beneath. The final com- 
parison is done between a cut at the root and 
the best cut found beneath this. Where detail 
is warranted by the specificity of the data this 
is manifested in an appropriate level of general- 
isation. The description length of the resultant 
cut model is then used for detecting diathesis 
alternations. 
4 Evidence for Diathesis 
Alternations 
For verbs participating in an alternation one 
might expect that the data in the alternating 
slots of the respective SCFs might be rather ho- 
mogenous. This will depend on the extent to 
which the alternation applies to the predomi- 
nant sense of the verb and the majority of senses 
of the arguments. The hypothesis here is that 
if the alternation is reasonably productive and 
could occur for a substantial majority of the in- 
stances then the preferences at the correspond- 
ing slots should be similar. Moreover we hy- 
pothesis that if the data at the alternating slots 
is combined then the cost of encoding this data 
in one ATCM will be less than the cost of encod- 
ing the data in separate models, for the respec- 
tive slot and SCF. 
Taking the causative-inchoative alternation 
as an example, the object of the transitive frame 
switches to the subject of the intransitive frame: 
The boy broke the window ~ The window broke. 
Our strategy is to find the cost of encoding the 
data from both slots in separate ATCMS and 
compare it to the cost of encoding the combined 
data. Thus the cost of an ATCM for / the sub- 
1494 
Table 1: Causative-Inchoative Evaluation 
verbs 
true positives begin end Change 
swing 
false positives cut 
true negatives choose like help 
charge expect add 
feel believe ask 
false negatives 
total 
move 
9 
1 
I115 
ject of the intransitive and ii the object of the 
transitive should exceed the cost of an ATCM for 
the combined data only for verbs to which the 
alternation applies. 
5 Experimental Results 
A subcategorization lexicon was produced from 
10.8 million words of parsed text from the 
British National Corpus. In this preliminary 
work a small sample of 30 verbs were examined. 
These were selected for the range of SCFs that 
they exhibit. The primary alternation selected 
was the causative-inchoative because a reason- 
able number of these verbs (15) take both sub- 
categorization frames involved. ATCM models 
were obtained for the data at the subject of the 
intransitive frame and object of the transitive. 
The cost of these models was then compared to 
the cost of the model produced when the two 
data sets were combined. 
Table 1 shows the results for the 15 verbs 
which took both the necessary frames. The sys- 
tem's decision as to whether the verb partici- 
pates in the alternation or not was compared 
to the verdict of a human judge. The accuracy 
was 87%( 4+9 ~ Random choice would give ~, 4+1+9+1/" 
a baseline of 50%. The cause for the one false 
positive cut was that cut takes the middle alter- 
nation (The butcher cuts the meat ~-~ the meat 
cuts easily). This alternation cannot be distin- 
guished from the causative-inchoative because 
the scF acquisition system drops the adverbial 
and provides the intransitive classification. 
Performance on the simple reciprocal in- 
transitive alternation (John agreed with Mary 
Mary and John agreed) was less satisfac- 
tory. Three potential candidates were selected 
by virtue of their SCFs swing;with add;to and 
agree;with. None of these were identified as tak- 
ing the alternation which gave rise to 2 true neg- 
atives and I false negative. From examining the 
results it seems that many of the senses found at 
the intransitive slot of agree e.g. policy would 
not be capable of alternating. It is at least en- 
couraging that the difference in the cost of the 
separate and combined models was low. 
6 Conclusions 
Using MDL to detect alternations seems to be 
a useful strategy in cases where the majority of 
senses in alternating slot position do indeed per- 
mit the alternation. In other cases the method 
is at least conservative. Further work will ex- 
tend the results to include a wider range of al- 
ternations and verbs. We also plan to use this 
method to investigate the degree of compression 
that the respective alternations can make to the 
lexicon as a whole. 

References 
Naoki Abe and Hang Li. 1996. Learning word 
association norms using tree cut pair models. 
In Proceedings of the 13th International Con- 
ference on Machine Learning ICML, pages 3- 
11. 
Ted Briscoe and John Carroll. 1997. Automatic 
extraction of subcategorization from corpora. 
In Fifth Applied Natural Language Processing 
Conference., pages 356-363. 
Anna Korhonen. 1997. Acquiring subcategori- 
sation from textual corpora. Master's thesis, 
University of Cambridge. 
Beth Levin. 1993. English Verb Classes and Al- 
ternations: a preliminary investigation. Uni- 
versity of Chicago Press, Chicago and Lon- 
don. 
George Miller, Richard Beckwith, Christine 
Felbaum, David Gross, and Katherine 
Miller, 1993. Introduction to Word- 
Net: An On-Line Lezical Database. 
ftp//darity.princeton.edu/pub/WordNet / 
5papers.ps. 
Francesc Ribas. 1995. On Acquiring Appropri- 
ate Selectional Restrictions from Corpora Us- 
ing a Semantic Taxonomy. Ph.D. thesis, Uni- 
versity of Catalonia. 
J. Rissanen. 1978. Modeling by shortest data 
description. Automatica, 14:465-471. 
