Inducing a Semantically Annotated Lexicon 
via EM-Based Clustering 
Mats Rooth 
Stefan Riezler 
Detlef Prescher 
Glenn Carroll 
Franz Beil 
Institut ffir Maschinelle Sprachverarbeitung 
University of Stuttgart, Germany 
Abstract 
We present a technique for automatic induction 
of slot annotations for subcategorization frames, 
based on induction of hidden classes in the EM 
framework of statistical estimation. The models 
are empirically evalutated by a general decision 
test. Induction of slot labeling for subcategoriza- 
tion frames is accomplished by a further applica- 
tion of EM, and applied experimentally on frame 
observations derived from parsing large corpora. 
We outline an interpretation of the learned rep- 
resentations as theoretical-linguistic decomposi- 
tional lexical entries. 
1 Introduction 
An important challenge in computational lin- 
guistics concerns the construction of large-scale 
computational lexicons for the numerous natu- 
ral languages where very large samples of lan- 
guage use are now available. Resnik (1993) ini- 
tiated research into the automatic acquisition 
of semantic selectional restrictions. Ribas (1994) 
presented an approach which takes into account 
the syntactic position of the elements whose se- 
mantic relation is to be acquired. However, those 
and most of the following approaches require as 
a prerequisite a fixed taxonomy of semantic rela- 
tions. This is a problem because (i) entailment 
hierarchies are presently available for few lan- 
guages, and (ii) we regard it as an open ques- 
tion whether and to what degree existing designs 
for lexical hierarchies are appropriate for repre- 
senting lexical meaning. Both of these consid- 
erations suggest the relevance of inductive and 
experimental approaches to the construction of 
lexicons with semantic information. 
This paper presents a method for automatic 
induction of semantically annotated subcatego- 
rization frames from unannotated corpora. We 
use a statistical subcat-induction system which 
estimates probability distributions and corpus 
frequencies for pairs of a head and a subcat 
frame (Carroll and Rooth, 1998). The statistical 
parser can also collect frequencies for the nomi- 
nal fillers of slots in a subcat frame. The induc- 
tion of labels for slots in a frame is based upon 
estimation of a probability distribution over tu- 
ples consisting of a class label, a selecting head, 
a grammatical relation, and a filler head. The 
class label is treated as hidden data in the EM- 
framework for statistical estimation. 
2 EM-Based Clustering 
In our clustering approach, classes are derived 
directly from distributional data--a sample of 
pairs of verbs and nouns, gathered by pars- 
ing an unannotated corpus and extracting the 
fillers of grammatical relations. Semantic classes 
corresponding to such pairs are viewed as hid- 
den variables or unobserved data in the context 
of maximum likelihood estimation from incom- 
plete data via the EM algorithm. This approach 
allows us to work in a mathematically well- 
defined framework of statistical inference, i.e., 
standard monotonicity and convergence results 
for the EM algorithm extend to our method. 
The two main tasks of EM-based clustering are 
i) the induction of a smooth probability model 
on the data, and ii) the automatic discovery of 
class-structure in the data. Both of these aspects 
are respected in our application of lexicon in- 
duction. The basic ideas of our EM-based clus- 
tering approach were presented in Rooth (Ms). 
Our approach constrasts with the merely heuris- 
tic and empirical justification of similarity-based 
approaches to clustering (Dagan et al., to ap- 
pear) for which so far no clear probabilistic 
interpretation has been given. The probability 
model we use can be found earlier in Pereira 
et al. (1993). However, in contrast to this ap- 
104 
Class 17 
PROB 0.0265 
0.0437 
0.0302 
0.0344 
0.0337 
0.0329 
0.0257 
0.0196 
0.0177 
0.0169 
0.0156 
0.0134 
10.0129 
0.0120 
0.0102 
0.0099 
0.0099 
0.0088 
0.0088 
0.0080 
0.0078 
increase.as:s 
increase.aso:o 
fall.as:s 
pay.aso:o 
reduce.aso:o 
rise.as:s 
exceed.aso:o 
exceed.aso:s 
affect.aso:o 
grow.as:s 
include.aso:s 
reach.aso:s 
decline.as:s 
lose.aso:o 
act.aso:s 
improve.aso:o 
include.aso:o 
cut.aso:o 
show.aso:o 
vary.as:s 
o~~ ~ .~.~ ~ o ~.~ 
": :::::::::::: ::: ::: : : ":':: : :. 
• • • • • • s • • • • s s • s • • • • • • • • 
• • s • • • s • s • s s s s s • 
• • • • • • • • s • • • • • • • s • • • • • • • • • • • • 
• • s • • • • • • • • • s • • • • s • • o • • • • 
• • s • • s • • • • • • • • • • s s • • • s • 
s s • • • • s • • • • s • s • • • • • • s • • s s 
• • • • • • • s s s • • • • • s • • s • s s • • • • • 
• • • • • • • • • • • • • • • • 
• • • • • • • • • • • • • • • • • • • s • • s • • • 
• • • • • • • • • • • • • • • s • • • • • • 
• • • • • • s • • • • • • • s s • • • 
• • • • s • • • • • • • • s • • s s • • • • • 
• • • • • • • • 
• • • • • • • • • s • • 
• s • • s s • • • • s • s • s • • • • s • • • • • s • • • 
1:'11:1 .......... • • • • • • • • • • • • s • • • • 
• • • • • • • • • • • • • • • • 
Figure 1: Class 
proach, our statistical inference method for clus- 
tering is formalized clearly as an EM-algorithm. 
Approaches to probabilistic clustering similar to 
ours were presented recently in Saul and Pereira 
(1997) and Hofmann and Puzicha (1998). There 
also EM-algorithms for similar probability mod- 
els have been derived, but applied only to sim- 
pler tasks not involving a combination of EM- 
based clustering models as in our lexicon induc- 
tion experiment. For further applications of our 
clustering model see Rooth et al. (1998). 
We seek to derive a joint distribution of verb- 
noun pairs from a large sample of pairs of verbs 
v E V and nouns n E N. The key idea is to view 
v and n as conditioned on a hidden class c E C, 
where the classes are given no prior interpreta- 
tion. The semantically smoothed probability of 
a pair (v, n) is defined to be: 
p(v,n) = ~~p(c,v,n)= ~-'\]p(c)p(vJc)p(nJc) 
cEC cEC 
The joint distribution p(c,v,n) is defined by 
p(c, v, n) = p(c)p(vlc)p(n\[c ). Note that by con- 
struction, conditioning of v and n on each other 
is solely made through the classes c. 
In the framework of the EM algorithm 
(Dempster et al., 1977), we can formalize clus- 
tering as an estimation problem for a latent class 
(LC) model as follows. We are given: (i) a sam- 
ple space y of observed, incomplete data, corre- 
17: scalar change 
sponding to pairs from VxN, (ii) a sample space 
X of unobserved, complete data, corresponding 
to triples from CxYxg, (iii) a set X(y) = {x E 
X \[ x = (c, y), c E C} of complete data related 
to the observation y, (iv) a complete-data speci- 
fication pe(x), corresponding to the joint proba- 
bility p(c, v, n) over C x V x N, with parameter- 
vector 0 : (0c, Ovc, OncJc E C, v e V, n E N), (v) 
an incomplete data specification Po(Y) which is 
related to the complete-data specification as the 
marginal probability Po(Y) -- ~~X(y)po(x). " 
The EM algorithm is directed at finding a 
value 0 of 0 that maximizes the incomplete- 
data log-likelihood function L as a func- 
tion of 0 for a given sample y, i.e., 0 = 
arg max L(O) where L(O) = lnl-IyP0(y ). 
0 
As prescribed by the EM algorithm, the pa- 
rameters of L(e) are estimated indirectly by pro- 
ceeding iteratively in terms of complete-data es- 
timation for the auxiliary function Q(0;0(t)), 
which is the conditional expectation of the 
complete-data log-likelihood lnps(x) given the 
observed data y and the current fit of the pa- 
rameter values 0 (t) (E-step). This auxiliary func- 
tion is iteratively maximized as a function of 
O (M-step), where each iteration is defined by 
the map O(t+l) = M(O(t) = argmax Q(O; 0 (t)) 
0 Note that our application is an instance of the 
EM-algorithm for context-free models (Baum et 
105 
Class 5 
PROB 0.0412 
0.0542 
0.0340 
0.0299 
0.0287 
0.0264 
0.0213 
0.0207 
0.0167 
0.0148 
0.0141 
0.0133 
0.0121 
0.0110 
0.0106 
0.0104 
0.0094 
0.0092 
0.0089 
0.0083 
0.0083 
~g ?~gg 
o o(D 
gggg 
o cD o o 
~ggggg~gg~Sgggggggg~g 
~ .D m 
~k.as:s Q • • ..... :11111:: 11: 
think,as:s • • • • • • • • • • • 
shake.aso:s • • • • • • • • • • • • • 
smile.as:s • • ..... 1: : 11:1:1::. 
reply.as:s • • 
shrug ..... : : : : : : : : : ° : : 
wonder.as:s • • • • • • • • • 
feel.aso:s • • • • • • • • • 
take.aso:s • • • • .... :1111. :11 : 
watch.aso:s • • • • • • • • • • • 
ask.aso:s • • • • • • • • • • • • • • 
tell.aso:s • • • • • • • • • • • • • 
look.as:s • • • • • • • • • • • 
~ive.~so:s • • • • • • • • • • • 
hear.aso:s • • • • • • • • • • 
grin.as:s • • • • • • • • • • • • 
answer.as:s • • • • • • • • • • 
_ .~ o ~ . .~ ~ 
:::''::::.:::::: 
• • • • • • Q • • • • • • • • 
• • • • • • • • • • • • • • 
1111:11::1.1:11: 
• ~ • • • • • • • • 
• • • • • • • • • • • • • • • 
• • • • • • • • • • • • • • • 
:':':':::::.'::: 
• • • • • • • • • • • • 
• • • • • • • • • • • • • • • 
• • • • • • • • • • 
• • • • • • t • • • • • • • • • 
• • • • • • • • • • • 
• • • • • • • • • • • • 
• • • • • • • • • • • • 
Figure 2: Class 5: communicative action 
al., 1970), from which the following particular- 
ily simple reestimation formulae can be derived. 
Let x = (c, y) for fixed c and y. Then 
M(Ovc) = Evetv)×g Po( lY) 
Eypo( ly) ' 
M(On~) = F'vcY×{n}P°(xiy) 
Eyp0( ly) ' 
E po( ly) 
lYl 
probabilistic context-free grammar of (Carroll 
and Rooth, 1998) gave for the British National 
Corpus (117 million words). 
e6 
7o 
55 
Intuitively, the conditional expectation of the 
number of times a particular v, n, or c choice 
is made during the derivation is prorated by the 
conditionally expected total number of times a 
choice of the same kind is made. As shown by 
Baum et al. (1970), these expectations can be 
calculated efficiently using dynamic program- 
ming techniques. Every such maximization step 
increases the log-likelihood function L, and a se- 
quence of re-estimates eventually converges to a 
(local) maximum of L. 
In the following, we will present some exam- 
ples of induced clusters. Input to the clustering 
algorithm was a training corpus of 1280715 to- 
kens (608850 types) of verb-noun pairs partici- 
pating in the grammatical relations of intransi- 
tive and transitive verbs and their subject- and 
object-fillers. The data were gathered from the 
maximal-probability parses the head-lexicalized 
Figure 3: Evaluation of pseudo-disambiguation 
Fig. 2 shows an induced semantic class out of 
a model with 35 classes. At the top are listed the 
20 most probable nouns in the p(nl5 ) distribu- 
tion and their probabilities, and at left are the 30 
most probable verbs in the p(vn5) distribution. 5 
is the class index. Those verb-noun pairs which 
were seen in the training data appear with a dot 
in the class matrix. Verbs with suffix .as : s in- 
dicate the subject slot of an active intransitive. 
Similarily .ass : s denotes the subject slot of an 
active transitive, and .ass : o denotes the object 
slot of an active transitive. Thus v in the above 
discussion actually consists of a combination of 
a verb with a subcat frame slot as : s, ass : s, 
or ass : o. Induced classes often have a basis 
in lexical semantics; class 5 can be interpreted 
106 
as clustering agents, denoted by proper names, 
"man", and "woman", together with verbs denot- 
ing communicative action. Fig. 1 shows a clus- 
ter involving verbs of scalar change and things 
which can move along scales. Fig. 5 can be in- 
terpreted as involving different dispositions and 
modes of their execution. 
3 Evaluation of Clustering Models 
3.1 Pseudo-Disambiguation 
We evaluated our clustering models on a pseudo- 
disambiguation task similar to that performed 
in Pereira et al. (1993), but differing in detail. 
The task is to judge which of two verbs v and 
v ~ is more likely to take a given noun n as its 
argument where the pair (v, n) has been cut out 
of the original corpus and the pair (v ~, n) is con- 
structed by pairing n with a randomly chosen 
verb v ~ such that the combination (v ~, n) is com- 
pletely unseen. Thus this test evaluates how well 
the models generalize over unseen verbs. 
The data for this test were built as follows. 
We constructed an evaluation corpus of (v, n, v ~) 
triples by randomly cutting a test corpus of 3000 
(v, n) pairs out of the original corpus of 1280712 
tokens, leaving a training corpus of 1178698 to- 
kens. Each noun n in the test corpus was com- 
bined with a verb v ~ which was randomly cho- 
sen according to its frequency such that the pair 
(v ~, n) did appear neither in the training nor in 
the test corpus. However, the elements v, v ~, and 
n were required to be part of the training corpus. 
Furthermore, we restricted the verbs and nouns 
in the evalutation corpus to the ones which oc- 
cured at least 30 times and at most 3000 times 
with some verb-functor v in the training cor- 
pus. The resulting 1337 evaluation triples were 
used to evaluate a sequence of clustering models 
trained from the training corpus. 
The clustering models we evaluated were 
• parametrized in starting values of the training 
algorithm, in the number of classes of the model, 
and in the number of iteration steps, resulting 
in a sequence of 3 × 10 x 6 models. Starting 
from a lower bound of 50 % random choice, ac- 
curacy was calculated as the number of times 
the model decided for p(nlv) > p(nlv' ) out of all 
choices made. Fig. 3 shows the evaluation results 
for models trained with 50 iterations, averaged 
over starting values, and plotted against class 
cardinality. Different starting values had an ef- 
76 
Figure 4: Evaluation on smoothing task 
fect of + 2 % on the performance of the test. 
We obtained a value of about 80 % accuracy for 
models between 25 and 100 classes. Models with 
more than 100 classes show a small but stable 
overfitting effect. 
3.2 Smoothing Power 
A second experiment addressed the smoothing 
power of the model by counting the number of 
(v, n) pairs in the set V x N of all possible combi- 
nations of verbs and nouns which received a pos- 
itive joint probability by the model. The V x N- 
space for the above clustering models included 
about 425 million (v, n) combinations; we ap- 
proximated the smoothing size of a model by 
randomly sampling 1000 pairs from V x N and 
returning the percentage of positively assigned 
pairs in the random sample. Fig. 4 plots the 
smoothing results for the above models against 
the number of classes. Starting values had an in- 
fluence of -+ 1% on performance. Given the pro- 
portion of the number of types in the training 
corpus to the V × N-space, without clustering 
we have a smoothing power of 0.14 % whereas 
for example a model with 50 classes and 50 it- 
erations has a smoothing power of about 93 %. 
Corresponding to the maximum likelihood 
paradigm, the number of training iterations had 
a decreasing effect on the smoothing perfor- 
mance whereas the accuracy of the pseudo- 
disambiguation was increasing in the number of 
iterations. We found a number of 50 iterations 
to be a good compromise in this trade-off. 
4 Lexicon Induction Based on 
Latent Classes 
The goal of the following experiment was to de- 
rive a lexicon of several hundred intransitive and 
transitive verbs with subcat slots labeled with 
latent classes. 
107 
4.1 Probabilistic Labeling with Latent 
Classes using EM-estimation 
To induce latent classes for the subject slot of 
a fixed intransitive verb the following statisti- 
cal inference step was performed. Given a la- 
tent class model PLC(') for verb-noun pairs, and 
a sample nl,... ,aM of subjects for a fixed in- 
transitive verb, we calculate the probability of 
an arbitrary subject n E N by: 
p(n) =  _,P(C)PLc(nlc). 
cEC cCC 
The estimation of the parameter-vector 0 = 
(Oclc E C) can be formalized in the EM frame- 
work by viewing p(n) or p(c, n) as a function of 
0 for fixed PLC(.). The re-estimation formulae 
resulting from the incomplete data estimation 
for these probability functions have the follow- 
ing form (f(n) is the frequency of n in the sam- 
ple of subjects of the fixed verb): 
M(Oc) = EneN f(n)po(cln) 
E, elv f (?%) 
A similar EM induction process can be applied 
also to pairs of nouns, thus enabling induction of 
latent semantic annotations for transitive verb 
frames. Given a LC model PLC(') for verb-noun 
pairs, and a sample (nl,n2)l,..., (nl,n2)M of 
noun arguments (ni subjects, and n2 direct ob- 
jects) for a fixed transitive verb, we calculate the 
probability of its noun argument pairs by: 
p(7%1, ?%2) = Ec,,c  c p(cl, c2, ?%1, ?%2) 
---- E c1 ,c2 6C P ( C1' C2 )PLC (?% 11cl )pLc (7%21c~) 
Again, estimation of the parameter-vector 
0 = (0clc210,c2 E C) can be formalized 
in an EM framework by viewing p(nl,n2) or 
p(cl,c2,nl,n2) as a function of 0 for fixed 
PLC(.). The re-estimation formulae resulting 
from this incomplete data estimation problem 
have the following simple form (f(nz, n2) is the 
frequency of (n!, n2) in the sample of noun ar- 
gument pairs of the fixed verb): 
M(Od~2) = Enl,n2eN f(7%1, n2)po(cl, c21nl, n2) 
Enl,   N Y(7%1, ?%2) 
Note that the class distributions p(c) and p(cl,C2) 
for intransitive and transitive models 
can be computed also for verbs unseen in the 
LC model. 
blush 5 0.982975 snarl 5 0.962094 
constance 3 
christina 3 
willie 2.99737 
ronni 2 
claudia 2 
gabriel 2 
maggie 2 
bathsheba 2 
sarah 2 
girl 1.9977 
mandeville 2 
jinkwa 2 
man 1.99859 
scott 1.99761 
omalley 1.99755 
shamlou 1 
angalo 1 
corbett 1 
southgate 1 
ace 1 
Figure 6: Lexicon entries: blush, snarl 
increase 17 0.923698 
number 134.147 
demand 30.7322 
pressure 30.5844 
temperature 25.9691 
cost 23.9431 
proportion 23.8699 
size 22.8108 
rate 20.9593 
level 20.7651 
price 17.9996 
Figure 7: Scalar motion increase. 
4.2 Lexicon Induction Experiment 
Experiments used a model with 35 classes. From 
maximal probability parses for the British Na- 
tional Corpus derived with a statistical parser 
(Carroll and Rooth, 1998), we extracted fre- 
quency tables for intransitve verb/subject pairs 
and transitive verb/subject/object triples. The 
500 most frequent verbs were selected for slot 
labeling. Fig. 6 shows two verbs v for which 
the most probable class label is 5, a class 
which we earlier described as communicative ac- 
tion, together with the estimated frequencies of 
f(n)po(cln ) for those ten nouns n for which this 
estimated frequency is highest. 
Fig. 7 shows corresponding data for an intran- 
sitive scalar motion sense of increase. 
Fig. 8 shows the intransitive verbs which take 
17 as the most probable label. Intuitively, the 
verbs are semantically coherent. When com- 
pared to Levin (1993)'s 48 top-level verb classes, 
we found an agreement of our classification with 
her class of "verbs of changes of state" except for 
the last three verbs in the list in Fig. 8 which is 
sorted by probability of the class label. 
Similar results for German intransitive scalar 
motion verbs are shown in Fig. 9. The data 
for these experiments were extracted from the 
maximal-probability parses of a 4.1 million word 
108 
Class 8 
PROB 0.0369 ooo o ooooooooo 
o~o ~ 0 ~ 
o ooooo o o o o ooo oooo 
0.0539 
0.0469 
0.0439 
0.0383 
0.0270 
0.0255 
0.0192 
0.0189 
0.0179 
0.0162 
0.0150 
0.0140 
0.0138 
0.0109 
0.0109 
0.0097 
0.0092 
0.0091 
require.aso:o 
show,aso:o 
need,aso:o 
involve.aso:o 
produce.aso:o 
occur.as:s 
cause.aso:s 
cause.aso:o 
affect.aso:s 
require.aso:s 
mean.aso:o 
suggest.aso:o 
produce.aso:s 
demand.aso:o 
reduce.aso:s 
reflect.aso:o 
involve.aso:s 
undergo.aso;o 
:::: 
1111 
111: 
!O • • • 
:::::::::::::: 
:::1:...: "..: 
:1.1.1111"11 : 
:::" : .. • 
• • • • • • • $ • $ • 
• • • • • • • • 
::1.11 :1:'1 
• • • • • • • • • • • • 
Figure 5: Class 8: dispositions 
0.977992 
0.948099 
0.923698 
0.908378 
0.877338 
0.876083 
0.803479 
0.672409 
0.583314 
decrease 
double 
increase 
decline 
rise 
soar 
fall 
slow 
diminish 
0.560727 
0.476524 
0.42842 
0.365586 
0.365374 
0.292716 
0.280183 
0.238182 
drop 
grow 
vary 
improve 
climb 
flow 
cut 
mount 
0.741467 ansteigen 
0.720221 steigen 
0.693922 absinken 
0.656021 sinken 
0.438486 schrumpfen 
0.375039 zuriickgehen 
0.316081 anwachsen 
0.215156 stagnieren 
0.160317 wachsen 
0.154633 hinzukommen 
(go up) 
(rise) 
(sink) 
(go down) 
(shrink) 
(decrease) 
(increase) 
(stagnate) 
(grow) 
(be added) 
Figure 8: Scalar motion verbs 
corpus of German subordinate clauses, yielding 
418290 tokens (318086 types) of pairs of verbs 
or adjectives and nouns. The lexicalized proba- 
bilistic grammar for German used is described 
in Beil et al. (1999). We compared the Ger- 
man example of scalar motion verbs to the lin- 
guistic classification of verbs given by Schuh- 
macher (1986) and found an agreement of our 
classification with the class of "einfache An- 
derungsverben" (simple verbs of change) except 
for the verbs anwachsen (increase) and stag- 
nieren(stagnate) which were not classified there 
at all. 
Fig. i0 shows the most probable pair of classes 
for increase as a transitive verb, together with 
estimated frequencies for the head filler pair. 
Note that the object label 17 is the class found 
with intransitive scalar motion verbs; this cor- 
respondence is exploited in the next section. 
Figure 9: German intransitive scalar motion 
verbs 
increase (8, 17) 0.3097650 
development - pressure 
fat - risk 
communication - awareness 
supplementation - concentration 
increase- number 
2.3055 
2.11807 
2.04227 
1.98918 
1.80559 
Figure 10: Transitive increase with estimated 
frequencies for filler pairs. 
5 Linguistic Interpretation 
In some linguistic accounts, multi-place verbs 
are decomposed into representations involv- 
ing (at least) one predicate or relation 
per argument. For instance, the transitive 
causative/inchoative verb increase, is composed 
of an actor/causative verb combining with a 
109 
VP /~ VP 
A 
NP vl NP V1 NP Vl 
VP V VP V VP V 
A 
NP V NP V NP V 
increase Riz R.,v ^ increase,v 
VP 
NP V I 
Rlr A increase~v 
Figure 11: First tree: linguistic lexical entry for 
transitive verb increase. Second: corresponding 
lexical entry with induced classes as relational 
constants. Third: indexed open class root added 
as conjunct in transitive scalar motion increase. 
Fourth: induced entry for related intransitive in- 
crease. 
one-place predicate in the structure on the left in 
Fig. 11. Linguistically, such representations are 
motivated by argument alternations (diathesis), 
case linking and deep word order, language ac- 
quistion, scope ambiguity, by the desire to repre- 
sent aspects of lexical meaning, and by the fact 
that in some languages, the postulated decom- 
posed representations are overt, with each primi- 
tive predicate corresponding to a morpheme. For 
references and recent discussion of this kind of 
theory see Hale and Keyser (1993) and Kural 
(1996). 
We will sketch an understanding of the lexi- 
cal representations induced by latent-class label- 
ing in terms of the linguistic theories mentioned 
above, aiming at an interpretation which com- 
bines computational leaxnability, linguistic mo- 
tivation, and denotational-semantic adequacy. 
The basic idea is that latent classes are compu- 
tational models of the atomic relation symbols 
occurring in lexical-semantic representations. As 
a first implementation, consider replacing the re- 
lation symbols in the first tree in Fig. 11 with 
relation symbols derived from the latent class la- 
beling. In the second tree in Fig 11, R17 and R8 
are relation symbols with indices derived from 
the labeling procedure of Sect. 4. Such represen- 
tations can be semantically interpreted in stan- 
dard ways, for instance by interpreting relation 
symbols as denoting relations between events 
and individuals. 
Such representations are semantically inad- 
equate for reasons given in philosophical cri- 
tiques of decomposed linguistic representations; 
see Fodor (1998) for recent discussion. A lex- 
icon' estimated in the above way has as many 
primitive relations as there are latent classes. We 
guess there should be a few hundred classes in an 
approximately complete lexicon (which would 
have to be estimated from a corpus of hun- 
dreds of millions of words or more). Fodor's ar- 
guments, which axe based on the very limited de- 
gree of genuine interdefinability of lexical items 
and on Putnam's arguments for contextual de- 
termination of lexical meaning, indicate that the 
number of basic concepts has the order of mag- 
nitude of the lexicon itself. More concretely, a 
lexicon constructed along the above principles 
would identify verbs which are labelled with the 
same latent classes; for instance it might identify 
the representations of grab and touch. 
For these reasons, a semantically adequate 
lexicon must include additional relational con- 
stants. We meet this requirement in a simple 
way, by including as a conjunct a unique con- 
stant derived from the open-class root, as in 
the third tree in Fig. 11. We introduce index- 
ing of the open class root (copied from the class 
index) in order that homophony of open class 
roots not result in common conjuncts in seman- 
tic representations--for instance, we don't want 
the two senses of decline exemplified in decline 
the proposal and decline five percent to have an 
common entailment represented by a common 
conjunct. This indexing method works as long 
as the labeling process produces different latent 
class labels for the different senses. 
The last tree in Fig. 11 is the learned represen- 
tation for the scalar motion sense of the intran- 
sitive verb increase. In our approach, learning 
the argument alternation (diathesis) relating the 
transitive increase (in its scalar motion sense) 
to the intransitive increase (in its scalar motion 
sense) amounts to learning representations with 
a common component R17 A increase17. In this 
case, this is achieved. 
6 Conclusion 
We have proposed a procedure which maps 
observations of subcategorization frames with 
their complement fillers to structured lexical 
entries. We believe the method is scientifically 
interesting, practically useful, and flexible be- 
cause: 
1. The algorithms and implementation are ef- 
ficient enough to map a corpus of a hundred 
million words to a lexicon. 
110 
2. The model and induction algorithm have 
foundations in the theory of parameter- 
ized families of probability distributions 
and statistical estimation. As exemplified 
in the paper, learning, disambiguation, and 
evaluation can be given simple, motivated 
formulations. 
3. The derived lexical representations are lin- 
guistically interpretable. This suggests the 
possibility of large-scale modeling and ob- 
servational experiments bearing on ques- 
tions arising in linguistic theories of the lex- 
icon. 
4. Because a simple probabilistic model is 
used, the induced lexical entries could be 
incorporated in lexicalized syntax-based 
probabilistic language models, in particular 
in head-lexicalized models. This provides 
for potential application in many areas. 
5. The method is applicable to any natural 
language where text samples of sufficient 
size, computational morphology, and a ro- 
bust parser capable of extracting subcate- 
gorization frames with their fillers are avail- 
able. 

References 
Leonard E. Baum, Ted Petrie, George Soules, 
and Norman Weiss. 1970. A maximiza- 
tion technique occuring in the statistical 
analysis of probabilistic functions of Markov 
chains. The Annals of Mathematical Statis- 
tics, 41(1):164-171. 
Franz Beil, Glenn Carroll, Detlef Prescher, Ste- 
fan Riezler, and Mats Rooth. 1999. Inside- 
outside estimation of a lexicalized PCFG for 
German. In Proceedings of the 37th Annual 
Meeting of the A CL, Maryland. 
Glenn Carroll and Mats Rooth. 1998. Valence 
induction with a head-lexicalized PCFG. In 
Proceedings of EMNLP-3, Granada. 
Ido Dagan, Lillian Lee, and Fernando Pereira. 
to appear. Similarity-based models of word 
cooccurence probabilities. Machine Learning. 
A. P. Dempster, N. M. Laird, and D. B. Rubin. 
1977. Maximum likelihood from incomplete 
data via the EM algorithm. Journal of the 
Royal Statistical Society, 39(B):1-38. 
Jerry A. Fodor. 1998. Concepts : Where Cogni- 
tire Science Went Wrong. Oxford Cognitive 
Science Series, Oxford. 
K. Hale and S.J. Keyser. 1993. Argument struc- 
ture and the lexical expression of syntactic re- 
lations. In K. Hale and S.J. Keyser, editors, 
The View from Building 20. MIT Press, Cam- 
bridge, MA. 
Thomas Hofmann and Jan Puzicha. 1998. Un- 
supervised learning from dyadic data. Tech- 
nical Report TR-98-042, International Com- 
puter Science Insitute, Berkeley, CA. 
Murat Kural. 1996. Verb Incorporation and El- 
ementary Predicates. Ph.D. thesis, University 
of California, Los Angeles. 
Beth Levin. 1993. English Verb Classes 
and Alternations. A Preliminary Investiga- 
tion. The University of Chicago Press, 
Chicago/London. 
Fernando Pereira, Naftali Tishby, and Lillian 
Lee. 1993. Distributional clustering of en- 
glish words. In Proceedings of the 31th Annual 
Meeting of the A CL, Columbus, Ohio. 
Philip Resnik. 1993. Selection and information: 
A class-bases approach to lexical relationships. 
Ph.D. thesis, University of Pennsylvania, CIS 
Department. 
Francecso Ribas. 1994. An experiment on learn- 
ing appropriate selectional restrictions from a 
parsed corpus. In Proceedings of COLING-9~, 
Kyoto, Japan. 
Mats Rooth, Stefan Riezler, Detlef Prescher, 
Glenn Carroll, and Franz Beil. 1998. EM- 
based clustering for NLP applications. In 
Inducing Lexicons with the EM Algorithm. 
AIMS Report 4(3), Institut fiir Maschinelle 
Sprachverarbeitung, Universit~t Stuttgart. 
Mats Rooth. Ms. Two-dimensional clusters in 
grammatical relations. In Symposium on Rep- 
resentation and Acquisition of Lexical Knowl- 
edge: Polysemy, Ambiguity, and Generativity. 
AAAI 1995 Spring Symposium Series, Stan- 
ford University. 
Lawrence K. Saul and Fernando Pereira. 1997. 
Aggregate and mixed-order Markov models 
for statistical language processing. In Pro- 
ceedings of EMNLP-2. 
Helmut Schuhmacher. 1986. Verben in Feldern. 
Valenzw5rterbuch zur Syntax und Semantik 
deutscher Verben. de Gruyter, Berlin. 
