DISTRIBUTIONAL CLUSTERING OF ENGLISH WORDS 
Fernando Pereira 
AT&T Bell Laboratories 
600 Mountain Ave. 
Murray Hill, NJ 07974, USA 
pereira@research, att. com 
Naftali Tishby 
Dept. of Computer Science 
Hebrew University 
Jerusalem 91904, Israel 
tishby@cs, hu\]i. ac. il 
Lillian Lee 
Dept. of Computer Science 
Cornell University 
Ithaca, NY 14850, USA 
llee~cs, cornell, edu 
Abstract 
We describe and evaluate experimentally a 
method for clustering words according to their dis- 
tribution in particular syntactic contexts. Words 
are represented by the relative frequency distribu- 
tions of contexts in which they appear, and rela- 
tive entropy between those distributions is used as 
the similarity measure for clustering. Clusters are 
represented by average context distributions de- 
rived from the given words according to their prob- 
abilities of cluster membership. In many cases, 
the clusters can be thought of as encoding coarse 
sense distinctions. Deterministic annealing is used 
to find lowest distortion sets of clusters: as the an- 
nealing parameter increases, existing clusters be- 
come unstable and subdivide, yielding a hierarchi- 
cal "soft" clustering of the data. Clusters are used 
as the basis for class models of word coocurrence, 
and the models evaluated with respect to held-out 
test data. 
INTRODUCTION 
Methods for automatically classifying words ac- 
cording to their contexts of use have both scien- 
tific and practical interest. The scientific ques- 
tions arise in connection to distributional views 
of linguistic (particularly lexical) structure and 
also in relation to the question of lexical acqui- 
sition both from psychological and computational 
learning perspectives. From the practical point 
of view, word classification addresses questions of 
data sparseness and generalization in statistical 
language models, particularly models for deciding 
among alternative analyses proposed by a gram- 
mar. 
It is well known that a simple tabulation of fre- 
quencies of certain words participating in certain 
configurations, for example of frequencies of pairs 
of a transitive main verb and the head noun of its 
direct object, cannot be reliably used for compar- 
ing the likelihoods of different alternative configu- 
rations. The problemis that for large enough cor- 
pora the number of possible joint events is much 
larger than the number of event occurrences in 
the corpus, so many events are seen rarely or 
never, making their frequency counts unreliable 
estimates of their probabilities. 
Hindle (1990) proposed dealing with the 
sparseness problem by estimating the likelihood of 
unseen events from that of "similar" events that 
have been seen. For instance, one may estimate 
the likelihood of a particular direct object for a 
verb from the likelihoods of that direct object for 
similar verbs. This requires a reasonable defini- 
tion of verb similarity and a similarity estimation 
method. In Hindle's proposal, words are similar if 
we have strong statistical evidence that they tend 
to participate in the same events. His notion of 
similarity seems to agree with our intuitions in 
many cases, but it is not clear how it can be used 
directly to construct word classes and correspond- 
ing models of association. 
Our research addresses some of the same ques- 
tions and uses similar raw data, but we investigate 
how to factor word association tendencies into as- 
sociations of words to certain hidden senses classes 
and associations between the classes themselves. 
While it may be worth basing such a model on pre- 
existing sense classes (Resnik, 1992), in the work 
described here we look at how to derive the classes 
directly from distributional data. More specifi- 
cally, we model senses as probabilistic concepts 
or clusters c with corresponding cluster member- 
ship probabilities p(clw ) for each word w. Most 
other class-based modeling techniques for natural 
language rely instead on "hard" Boolean classes 
(Brown et al., 1990). Class construction is then 
combinatorially very demanding and depends on 
frequency counts for joint events involving partic- 
ular words, a potentially unreliable source of in- 
formation as noted above. Our approach avoids 
both problems. 
Problem Setting 
In what follows, we will consider two major word 
classes, 12 and Af, for the verbs and nouns in our 
experiments, and a single relation between them, 
in our experiments the relation between a tran- 
sitive main verb and the head noun of its direct 
object. Our raw knowledge about the relation con- 
sists of the frequencies f~n of occurrence of par- 
ticular pairs (v,n) in the required configuration 
in a training corpus. Some form of text analy- 
sis is required to collect such a collection of pairs. 
The corpus used in our first experiment was de- 
rived from newswire text automatically parsed by 
183 
Hindle's parser Fidditch (Hindle, 1993). More re- 
cently, we have constructed similar tables with the 
help of a statistical part-of-speech tagger (Church, 
1988) and of tools for regular expression pattern 
matching on tagged corpora (Yarowsky, 1992). We 
have not yet compared the accuracy and cover- 
age of the two methods, or what systematic biases 
they might introduce, although we took care to fil- 
ter out certain systematic errors, for instance the 
misparsing of the subject of a complement clause 
as the direct object of a main verb for report verbs 
like "say". 
We will consider here only the problem of clas- 
sifying nouns according to their distribution as di- 
rect objects of verbs; the converse problem is for- 
mally similar. More generally, the theoretical ba- 
sis for our method supports the use of clustering 
to build models for any n-ary relation in terms of 
associations between elements in each coordinate 
and appropriate hidden units (cluster centroids) 
and associations between thosehidden units. 
For the noun classification problem, the em- 
pirical distribution of a noun n is then given by 
the conditional distribution p,~(v) = f~./ ~v f"~" 
The problem we study is how to use the Pn to clas- 
sify the n EAf. Our classification method will con- 
struct a set C of clusters and cluster membership 
probabilities p(c\]n). Each cluster c is associated to 
a cluster centroid Pc, which is a distribution over 
l; obtained by averaging appropriately the pn. 
Distributional Similarity 
To cluster nouns n according to their conditional 
verb distributions Pn, we need a measure of simi- 
larity between distributions. We use for this pur- 
pose the relative entropy or Kullback-Leibler (KL) distance 
between two distributions 
O(p I\[ q) = ZP(x) log p(x) : q(x) 
This is a natural choice for a variety of reasons, 
which we will just sketch here) 
First of all, D(p I\[ q) is zero just when p = q, 
and it increases as the probability decreases that 
p is the relative frequency distribution of a ran- 
dom sample drawn according to q. More formally, 
the probability mass given by q to the set of all 
samples of length n with relative frequency distri- 
bution p is bounded by exp-nn(p I\] q) (Cover 
and Thomas, 1991). Therefore, if we are try- 
ing to distinguish among hypotheses qi when p is 
the relative frequency distribution of observations, D(p II ql) 
gives the relative weight of evidence in 
favor of qi. Furthermore, a similar relation holds 
between D(p IIP') for two empirical distributions p 
and p' and the probability that p and p~ are drawn 
from the same distribution q. We can thus use the 
relative entropy between the context distributions 
for two words to measure how likely they are to 
be instances of the same cluster centroid. 
aA more formal discussion will appear in our paper Distributional Clustering, in preparation. 
From an information theoretic perspective D(p \]1 q) 
measures how inefficient on average it 
would be to use a code based on q to encode a 
variable distributed according to p. With respect 
to our problem, D(pn H Pc) thus gives us the infor- 
mation loss in using cluster centroid Pc instead of 
the actual distribution pn for word n when mod- 
eling the distributional properties of n. 
Finally, relative entropy is a natural measure 
of similarity between distributions for clustering 
because its minimization leads to cluster centroids 
that are a simple weighted average of member dis- 
tributions. 
One technical difficulty is that D(p \[1 p') is 
not defined when p'(x) = 0 but p(x) > 0. We 
could sidestep this problem (as we did initially) by 
smoothing zero frequencies appropriately (Church 
and Gale, 1991). However, this is not very sat- 
isfactory because one of the goals of our work is 
precisely to avoid the problems of data sparseness 
by grouping words into classes. It turns out that 
the problem is avoided by our clustering technique, 
since it does not need to compute the KL distance 
between individual word distributions, but only 
between a word distribution and average distri- 
butions, the current cluster centroids, which are 
guaranteed to be nonzero whenever the word dis- 
tributions are. This is a useful advantage of our 
method compared with agglomerative clustering 
techniques that need to compare individual ob- 
jects being considered for grouping. 
THEORETICAL BASIS 
In general, we are interested in how to organize 
a set of linguistic objects such as words according 
to the contexts in which they occur, for instance 
grammatical constructions or n-grams. We will 
show elsewhere that the theoretical analysis out- 
lined here applies to that more general problem, 
but for now we will only address the more specific 
problem in which the objects are nouns and the 
contexts are verbs that take the nouns as direct 
objects. 
Our problem can be seen as that of learning a 
joint distribution of pairs from a large sample of 
pairs. The pair coordinates come from two large 
sets ./kf and 12, with no preexisting internal struc- 
ture, and the training data is a sequence S of N 
independently drawn pairs 
Si = (ni, vi) 1 < i < N . 
From a learning perspective, this problem falls 
somewhere in between unsupervised and super- 
vised learning. As in unsupervised learning, the 
goal is to learn the underlying distribution of the 
data. But in contrast to most unsupervised learn- 
ing settings, the objects involved have no internal 
structure or attributes allowing them to be com- 
pared with each other. Instead, the only informa- 
tion about the objects is the statistics of their joint 
appearance. These statistics can thus be seen as a 
weak form of object labelling analogous to super- 
vision. 
184 
Distributional Clustering 
While clusters based on distributional similarity 
are interesting on their own, they can also be prof- 
itably seen as a means of summarizing a joint dis- 
tribution. In particular, we would like to find a 
set of clusters C such that each conditional dis- 
tribution pn(v) can be approximately decomposed 
as 
p,(v) = ~p(cln)pc(v) , 
cEC 
where p(c\[n) is the membership probability of n in 
c and pc(v) = p(vlc ) is v's conditional probability 
given by the centroid distribution for cluster c. 
The above decomposition can be written in a 
more symmetric form as 
~(n,v) = ~_,p(c,n)p(vlc ) 
cEC 
= ~-~p(c)P(nlc)P(Vlc) (1) 
cEC 
assuming that p(n) and /5(n) coincide. We will 
take (1) as our basic clustering model. 
To determine this decomposition we need to 
solve the two connected problems of finding suit- 
able forms for the cluster membership p(c\[n) and 
the centroid distributions p(vlc), and of maximiz- 
ing the goodness of fit between the model distri- 
bution 15(n, v) and the observed data. 
Goodness of fit is determined by the model's 
likelihood of the observations. The maximum like- 
lihood (ML) estimation principle is thus the nat- 
ural tool to determine the centroid distributions pc(v). 
As for the membership probabilities, they 
must be determined solely by the relevant mea- 
sure of object-to-cluster similarity, which in the 
present work is the relative entropy between ob- 
ject and cluster centroid distributions. Since no 
other information is available, the membership is 
determined by maximizing the configuration en- 
tropy for a fixed average distortion. With the max- 
imum entropy (ME) membership distribution, ML 
estimation is equivalent to the minimization of the 
average distortion of the data. The combined en- 
tropy maximization entropy and distortion min- 
imization is carried out by a two-stage iterative 
process similar to the EM method (Dempster et 
al., 1977). The first stage of an iteration is a max- 
imum likelihood, or minimum distortion, estima- 
tion of the cluster centroids given fixed member- 
ship probabilities. In the second stage of each iter- 
ation, the entropy of the membership distribution 
is maximized for a fixed average distortion. This 
joint optimization searches for a saddle point in 
the distortion-entropy parameters, which is equiv- 
alent to minimizing a linear combination of the 
two known as free energy in statistical mechanics. 
This analogy with statistical mechanics is not co- 
incidental, and provides a better understanding of 
the clustering procedure. 
Maximum Likelihood Cluster 
Centroids 
For the maximum likelihood argument, we start by 
estimating the likelihood of the sequence S of N 
independent observations of pairs (ni,vi). Using 
(1), the sequence's model log likelihood is 
N 
l(S) = log p(c)p(n, le)p(vilc). 
i=l cEC 
Fixing the number of clusters (model size) Icl, we 
want to maximize l(S) with respect to the distri- 
butions P(nlc ) and p(vlc). The variation of l(S) 
with respect to these distributions is 
N /v(v, Ic)@(n  
~fl(S) =~ 1 ~..~p(c)| + / (2) 
i=1 P(ni, vi) c~c \P(nilc)6p(vi Ic)\] 
with p(nlc ) and p(vlc ) kept normalized. Using 
Bayes's formula, we have 
1 v( lni, 
~(ni, vi) -- p(c)p(ni\[c)p(vi\[c) (3) 
for any c. 2 Substituting (3) into (2), we obtain 
N (,logp(n, lc)) ~l(S) = ZZp(clni,vi) + 
(4) logp(vi 
Ic) i=1 cEC 
since ~flogp -- @/p. This expression is particu- 
larly useful when the cluster distributions p(n\[c) 
and p(vlc ) have an exponential form, precisely 
what will be provided by the ME step described 
below. 
At this point we need to specify the cluster- 
ing model in more detail. In the derivation so far 
we have treated, p(n c) and p(v c) symmetrically, 
corresponding to clusters not of verbs or nouns 
but of verb-noun associations. In principle such 
a symmetric model may be more accurate, but in 
this paper we will concentrate on asymmetric mod- els 
in which cluster memberships are associated to 
just one of the components of the joint distribution 
and the cluster centroids are specified only by the 
other component. In particular, the model we use 
in our experiments has noun clusters with cluster 
memberships determined by p(nlc) and centroid 
distributions determined by p(vlc ). 
The asymmetric model simplifies the estima- 
tion significantly by dealing with a single compo- 
nent, but it has the disadvantage that the joint 
distribution, p(n, v) has two different and not nec- 
essarily consistent expressions in terms of asym- 
metric models for the two coordinates. 
2As usual in clustering models (Duda and Hart, 
1973), we assume that the model distribution and the 
empirical distribution are interchangeable at the solu- 
tion of the parameter estimation equations, since the 
model is assumed to be able to represent correctly the 
data at that solution point. In practice, the data may not come exactly from the chosen model class, but the 
model obtained by solving the estimation equations 
may still be the closest one to the data. 
185 
Maximum Entropy Cluster Membership 
While variations of p(nlc ) and p(vlc ) iri equation 
(4) are not independent, we can treat them sep- 
arately. First, for fixed average distortion be- 
tween the cluster centroid distributions p(vlc ) and 
the data p(vln), we find the cluster membership 
probabilities, which are the Bayes inverses of the p(nlc), 
that maximize the entropy of the cluster 
distributions. With the membership distributions 
thus obtained, we then look for the p(vlc ) that 
maximize the log likelihood l(S). It turns out 
that this will also be the values ofp(vlc) that mini- 
mize the average distortion between the asymmet- 
ric cluster model and the data. 
Given any similarity measure din , c) between 
nouns and cluster centroids, the average cluster 
distortion is 
(0) = ~_, ~,p(cln)d(n,c ) (5) 
nEAr tEd 
If we maximize the cluster membership entropy 
H = - ~ Zp(cln)logp(nlc) (6) 
nEX cEd 
subject to normalization ofp(nlc) and fixed (5), we 
obtain the following standard exponential forms 
(Jaynes, 1983) for the class and membership dis- 
tributions 
1 p(nlc) = Z-¢ exp -rid(n, c) (7) 
1 p(cJn) = ~ exp -rid(n, c) (8) 
where the normalization sums (partition func- 
tions) are Z~ = ~,~ exp-fld(n,c) and Zn = 
~exp-rid(n,c). Notice that d(n,c) does not 
need to be symmetric for this derivation, as the 
two distributions are simply related by Bayes's 
rule. 
Returning to the log-likelihood variation (4), 
we can now use (7) for p(n\[c) and the assumption 
for the asymmetric model that the cluster mem- 
bership stays fixed as we adjust the centroids, to 
obtain 
N 
61(S) = - ~ ~ p(elni)6rid(n,, c) + ~ log Z~ (9) 
i=1 eEC 
where the variation of p(v\[c) is now included in 
the variation of d(n, e). 
For a large enough sample, we may replace the 
sum over observations in (9) by the average over N 
61(s) = - p(n)  -"p(¢ln)6rid(n, ¢) + 6 logZ¢ 
nEN cEC 
which, applying Bayes's rule, becomes 
1 61(S) = - ~ ~(~ ~ p(nlc)6rid(n, c) + 6 log Z¢. 
eEC hEN 
At the log-likelihood maximum, this variation 
must vanish. We will see below that the use of rel- 
ative entropy for similarity measure makes 6 log Zc 
vanish at the maximum as well, so the log likeli- 
hood can be maximized by minimizing the average 
distortion with respect to the class centroids while 
class membership is kept fixed 
1 p(njc)6d(n,e)= o , 
cEC nEX 
or, sufficiently, if each of the inner sums vanish 
~ p(nlcl6d(n,c)= 0 (10) 
tee nEAr 
Minimizing the Average KL Distortion We 
first show that the minimization of the relative 
entropy yields the natural expression for cluster 
centroids 
P(vle ) = ~ p(nlc)p(vln ) (11) 
nEW 
To minimize the average distortion (10), we ob- 
serve that the variation of the KL distance be- 
tween noun and centroid distributions with re- 
spect to the centroid distribution p(v\[c), with each 
centroid distribution normalized by the Lagrange 
multiplier Ac, is given by 
( - ~evP(V\[n)l°gp(v\[c) ) ~d(n,c) = ~ + 
A¢(E,~ev p(vlc) - 1) 
= ~-~( p(vln)+AO,p(vlc ) v(vl ) 
Substituting this expression into (10), we obtain 
, ,~ v p(vlc) 
Since the ~p(vlc ) are now independent, we obtain 
immediately the desired centroid expression (11), 
which is the desired weighted average of noun dis- 
tributions. 
We can now see that the variation (5 log Z~ van- 
ishes for centroid distributions given by (11), since 
it follows from (10) that 
6 log = exp-rid(, , c)6d(n, e) 
Ze 
-ri -- 0 
n 
The Free Energy Function The combined 
minimum distortion and maximum entropy opti- 
mization is equivalent to the minimization of a sin- 
gle function, the free energy 
1 log Zn F = -~ 
= <D>-"Hlri , 
where (D) is the average distortion (5) and H is 
the cluster membership entropy (6). 
186 
The free energy determines both the distor- 
tion and the membership entropy through 
OZF 
(D) - O~ 
OF H - 
OT ' 
where T =/~-1 is the temperature. 
The most important property of the free en- 
ergy is that its minimum determines the balance 
between the "disordering" maximum entropy and 
"ordering" distortion minimization in which the 
system is most likely to be found. In fact the prob- 
ability to find the system at a given configuration 
is exponential in F 
Pocexp-flF , 
so a system is most likely to be found in its mini- 
mal free energy configuration. 
Hierarchical Clustering 
The analogy with statistical mechanics suggests 
a deterministic annealing procedure for clustering 
Rose et al., 1990), in which the number of clusters 
s determined through a sequence of phase transi- 
tions by continuously increasing the parameter/? 
following an annealing schedule. 
The higher is fl, the more local is the influence 
of each noun on the definition of centroids. Dis- 
tributional similarity plays here the role of distor- 
tion. When the scale parameter fl is close to zero, 
the similarity is almost irrelevant. All words con- 
tribute about equally to each centroid, and so the 
lowest average distortion solution involves just one 
cluster whose centroid is the average of all word 
distributions. As fl is slowly increased, a critical 
point is eventually reached for which the lowest 
F solution involves two distinct centroids. We say 
then that the original cluster has split into the two 
new clusters. 
In general, if we take any cluster c and a twin 
c' of c such that the centroid Pc' is a small ran- 
dom perturbation of Pc, below the critical fl at 
which c splits the membership and centroid reesti- 
mation procedure given by equations (8) and (11) 
will make pc and Pc, converge, that is, c and c' 
are really the same cluster. But with fl above the 
critical value for c, the two centroids will diverge, 
giving rise to two daughters of c. 
Our clustering procedure is thus as follows. 
We start with very low /3 and a single cluster 
whose centroid is the average of all noun distri- 
butions. For any given fl, we have a current set of 
leaf clusters corresponding to the current free en- 
ergy (local) minimum. To refine such a solution, 
we search for the lowest fl which is the critical 
value for some current leaf cluster splits. Ideally, 
there is just one split at that critical value, but 
for practical performance and numerical accuracy 
reasons we may have several splits at the new crit- 
ical point. The splitting procedure can then be 
repeated to achieve the desired number of clusters 
or model cross-entropy. 
3 
gun 
missile 
weapon 
rocket 
root 
1 
missile 0.835 officer 
rocket 0.850 aide 
bullet 0.917 chief 
0.940 manager 
4 
0.758 shot 0.858 
0.786 bullet 0.925 
0.862 rocket 0.930 
0.875 missile 1.037 
2 
0.484 
0.612 
0.649 
0.651 
Figure 1: Direct object clusters for fire 
CLUSTERING EXAMPLES 
All our experiments involve the asymmetric model 
described in the previous section. As explained 
there, our clustering procedure yields for each 
value of ~ a set CZ of clusters minimizing the free 
energy F, and the asymmetric model for fl esti- 
mates the conditional verb distribution for a noun 
n by 
cECB 
where p(cln ) also depends on ft. 
As a first experiment, we used our method to 
classify the 64 nouns appearing most frequently 
as heads of direct objects of the verb "fire" in one 
year (1988) of Associated Press newswire. In this 
corpus, the chosen nouns appear as direct object 
heads of a total of 2147 distinct verbs, so each 
noun is represented by a density over the 2147 
verbs. 
Figure 1 shows the four words most similar to 
each cluster centroid, and the corresponding word- 
centroid KL distances, for the four clusters result- 
ing from the first two cluster splits. It can be seen 
that first split separates the objects corresponding 
to the weaponry sense of "fire" (cluster 1) from the 
ones corresponding to the personnel action (clus- 
ter 2). The second split then further refines the 
weaponry sense into a projectile sense (cluster 3) 
and a gun sense (cluster 4). That split is some- 
what less sharp, possibly because not enough dis- 
tinguishing contexts occur in the corpus. 
Figure 2 shows the four closest nouns to the 
centroid of each of a set of hierarchical clus- 
ters derived from verb-object pairs involving the 
1000 most frequent nouns in the June 1991 elec- 
tronic version of Grolier's Encyclopedia (10 mil- 
187 
grant distinction 
form 
representation 
state 1.320 t residence 
ally 1.458 state residence 1.473 conductor 
/,..movement 1.534 teacher 
 "-number 0.999 number material 1.361 material 
variety 1.401 mass mass 1.422'~ variety 
~number 
diversity structure 
concentration 
J control 1.2011 
recognition 1.317 nomination 1.363 
~i~i~im 1.366 
1.392 ent 1.329 _ 
1.554 voyage 1.338 -~- 1.571 ~migration 1.428 
1.577 progress 1.441 ~ 
conductor 0.699 j Istate \]1.279 I 
vice-president 0.756~eople I 1.417\] editor 0.814 Imodem 1.418 
director 0.825 \[farmer 1.425 
1.082 j complex 1.161 ~aavy 1.096 I 1.102 network 1.175_._._~ommunity 1.099 I 
1.213 community 1.276 \]aetwork 1.244 
1.233 group 1.327~ Icomplex 1.259 
"~omplex \[1.097 I Imaterial \[ 0.976 ~network I 1"2111 
1.026 ~alt \] 1.217\[ lake 11.3601 1.093 ...------'-'-~mg 1.2441 ~region 11.4351 
1.252 ~aumber 1.250\[ ~ssay \[0.695 I 
l'278~number 1.047 Icomedy 10.8001 
comedy 1.060..------"~oem \[ 0"8291 essay 1.142 f-reatise \[ 0.850\] 
piece 1.198"~urnber 11.120 I 
~¢ariety 1.217 I ~aterial 1.275 I 
Fluster 1.3111 
~tructure \[ 1.3711 ~elationship 1.460 I 
1.429 change 1.561 j...~P ect 1.492\[ 1.537 failure 1.562"-"'- \]system 1.497 I 
1.577 variation 1.592~ iaollution 1.187\] 
1.582, structure 1.592 ~"~ailure 1.290 I \ 
\[re_crease 1.328 I Imtection 1.432\] 
speed 1.177 ~number 11.4611 
level 1.315 _.,__Jconcentration 1.478 I velocity 1.371 ~trength 1.488 I 
size 1.440~ ~atio 1.488 I 
~)lspeed 11.130 I ~enith 11.2141 
epth 1.2441 
ecognition 0.874\] 
tcclaim 1.026 I enown 1.079 
nomination 1.104 
form 1.110 I ~xplanation 1.255 I 
:are 1.2911 :ontrol 1.295 I 
voyage 0.8611 Lrip 0.972\] 
progress 1.016 I improvement 1.114 I 
)rogram 1.459 I ,peration 1.478 I 
:tudy 1.480 I nvestigation 1.4811 
;onductor 0.457\] rice-president 0.474 I 
lirector 0.489 I :hairman 0.5001 
Figure 2: Noun Clusters for Grolier's Encyclopedia 
188 
£ ~3 
~o 
-~ ¢ train 
,*-----, test p 
k s- - - -D new 
--tt- ........................ ~ ...................................... 
t t t 
0 0 100 200 300 400 
number of dusters 
Figure 3: Asymmetric Model Evaluation, AP88 
Verb-Direct Object Pairs 
0.8 "\. 
m.......~ exceptional 
3 0.6 
-o 0.4 
0.2 
- s 
L , . , i 
0 0 100 200 300 
number of clusters 
400 
Figure 4: Pairwise Verb Comparisons, AP88 Verb- 
Direct Object Pairs 
lion words). 
MODEL EVALUATION 
The preceding qualitative discussion provides 
some indication of what aspects of distributional 
relationships may be discovered by clustering. 
However, we also need to evaluate clustering more 
rigorously as a basis for models of distributional 
relationships. So, far, we have looked at two kinds 
of measurements of model quality: (i) relative en- 
tropy between held-out data and the asymmetric 
model, and (ii) performance on the task of decid- 
ing which of two verbs is more likely to take a given 
noun as direct object when the data relating one 
of the verbs to the noun has been withheld from 
the training data. 
The evaluation described below was per- 
formed on the largest data set we have worked 
with so far, extracted from 44 million words of 
1988 Associated Press newswire with the pattern 
matching techniques mentioned earlier. This col- 
lection process yielded 1112041 verb-object pairs. 
We selected then the subset involving the 1000 
most frequent nouns in the corpus for clustering, 
and randomly divided it into a training set of 
756721 pairs and a test set of 81240 pairs. 
Relative Entropy 
Figure 3 plots the unweighted average relative en- 
tropy, in bits, of several test sets to asymmet- 
ric clustered models of different sizes, given by 
1 ~,,eAr, D(t,,ll/~-), where Aft is the set of di- 
rect objects in the test set and t,~ is the relative 
frequency distribution of verbs taking n as direct 
object in the test set. 3 For each critical value 
of f?, we show the relative entropy with respect to 
awe use unweighted averages because we are inter- 
ested her on how well the noun distributions are ap- proximated by the cluster model. If we were interested 
on the total information loss of using the asymmetric 
model to encode a test corpus, we would instead use 
the asymmetric model based on gp of the train- 
ing set (set train), of randomly selected held-out 
test set (set test), and of held-out data for a fur- 
ther 1000 nouns that were not clustered (set new). 
Unsurprisingly, the training set relative entropy 
decreases monotonically. The test set relative en- 
tropy decreases to a minimum at 206 clusters, and 
then starts increasing, suggesting that larger mod- 
els are overtrained. 
The new noun test set is intended to test 
whether clusters based on the 1000 most frequent 
nouns are useful classifiers for the selectional prop- 
erties of nouns in general. Since the nouns in the 
test set pairs do not occur in the training set, we 
do not have their cluster membership probabilities 
that are needed in the asymmetric model. Instead, 
for each noun n in the test set, we classify it with 
respect to the clusters by setting 
p(cln) = exp -DD(p,~ I lc)/Z, 
where p,~ is the empirical conditional verb distri- 
bution for n given by the test set. These cluster 
membership estimates were then used in the asym- 
metric model and the test set relative entropy cal- 
culated as before. As the figure shows, the cluster 
model provides over one bit of information about 
the selectional properties of the new nouns, but 
the overtraining effect is even sharper than for the 
held-out data involving the 1000 clustered nouns. 
Decision Task 
We also evaluated asymmetric cluster models on 
a verb decision task closer to possible applications 
to disambiguation in language analysis. The task 
consists judging which of two verbs v and v' is 
more likely to take a given noun n as object, when 
all occurrences of (v, n) in the training set were 
deliberately deleted. Thus this test evaluates how 
well the models reconstruct missing data in the 
the weighted average ~,~e~t fnD(t,~ll~,,) where f,, is 
the relative frequency of n in the test set. 
189 
verb distribution for n from the cluster centroids 
close to n. 
The data for this test was built from the train- 
ing data for the previous one in the following way, 
based on a suggestion by Dagan et al. (1993). 104 
noun-verb pairs with a fairly frequent verb (be- 
tween 500 and 5000 occurrences) were randomly 
picked, and all occurrences of each pair in the 
training set were deleted. The resulting training 
set was used to build a sequence of cluster models 
as before. Each model was used to decide which of 
two verbs v and v ~ are more likely to appear with 
a noun n where the (v, n) data was deleted from 
the training set, and the decisions were compared 
with the corresponding ones derived from the orig- 
inal event frequencies in the initial data set. The 
error rate for each model is simply the proportion 
of disagreements for the selected (v, n, v t) triples. 
Figure 4 shows the error rates for each model for 
all the selected (v, n, v ~) (al 0 and for just those 
exceptional triples in which the conditional ratio 
p(n, v)/p(n, v ~) is on the opposite side of 1 from 
the marginal ratio p(v)/p(v~). In other words, the 
exceptional cases are those in which predictions 
based just on the marginal frequencies, which the 
initial one-cluster model represents, would be con- 
sistently wrong. 
Here too we see some overtraining for the 
largest models considered, although not for the ex- 
ceptional verbs. 
CONCLUSIONS 
We have demonstrated that a general divisive clus- 
tering procedure for probability distributions can 
be used to group words according to their partic- 
ipation in particular grammatical relations with 
other words. The resulting clusters are intuitively 
informative, and can be used to construct class- 
based word coocurrence models with substantial 
predictive power. 
While the clusters derived by the proposed 
method seem in many cases semantically signif- 
icant, this intuition needs to be grounded in a 
more rigorous assessment. In addition to predic- 
tive power evaluations of the kind we have al- 
ready carried out, it might be worth comparing 
automatically-derived clusters with human judge: 
ments in a suitable experimental setting. 
Moving further in the direction of class-based 
language models, we plan to consider additional 
distributional relations (for instance, adjective- 
noun) and apply the results of clustering to 
the grouping of lexical associations in lexicalized 
grammar frameworks such as stochastic lexicalized 
tree-adjoining grammars (Schabes, 1992). 
ACKNOWLEDGMENTS 
We would like to thank Don Hindle for making 
available the 1988 Associated Press verb-object 
data set, the Fidditch parser and a verb-object 
structure filter, Mats Rooth for selecting the ob- 
jects of "fire" data set and many discussions, 
David Yarowsky for help with his stemming and 
concordancing tools, andIdo Dagan for suggesting 
ways of testing cluster models. 

REFERENCES 
Peter F. Brown, Vincent J. Della Pietra, Peter V. deS- 
ouza, Jenifer C. Lal, and Robert L. Mercer. 1990. 
Class-based n-gram models of natural language. 
In Proceedings of the IBM Natural Language ITL, 
pages 283-298, Paris, France, March. 
Kenneth W. Church and William A. Gale. 1991. 
A comparison of the enhanced Good-Turing and 
deleted estimation methods for estimating proba- 
bilities of English bigrams. Computer Speech and 
Language, 5:19-54. 
Kenneth W. Church. 1988. A stochastic parts pro- 
gram and noun phrase parser for unrestricted 
text. In Proceedings of the Second Conference 
on Applied Natural Language Processing, pages 
136-143, Austin, Texas. Association for Compu- 
tational Linguistics, Morristown, New Jersey. 
Thomas M. Cover and Joy A. Thomas. 1991. Ele- 
ments of Information Theory. Wiley-Interscience, 
New York, New York. 
Ido Dagan, Shaul Markus, and Shaul Markovitch. 
1993. Contextual word similarity and estimation 
from sparse data. In these proceedings. 
A. P. Dempster, N. M. Laird, and D. B. Rubin. 1977. 
Maximum likelihood from incomplete data via the 
EM algorithm. Journal of the Royal Statistical 
Society, Series B, 39(1):1-38. 
Richard O. Duda and Peter E. Hart. 1973. Pat- 
tern Classification and Scene Analysis. Wiley- 
Interseience, New York, New York. 
Donald Hindle. 1990. Noun classification from 
predicate-argument structures. In 28th Annual 
Meeting of the Association for Computational 
Linguistics, pages 268-275, Pittsburgh, Pennsyl- 
vania. Association for Computational Linguistics, 
Morristown, New Jersey. 
Donald Hindle. 1993. A parser for text corpora. In 
B.T.S. Atldns and A. Zampoli, editors, Computational Approaches to the Lexicon. 
Oxford Univer- 
sity Press, Oxford, England. To appear. 
Edwin T. Jaynes. 1983. Brandeis lectures. In 
Roger D. Rosenkrantz, editor, E. T. Jaynes: Papers on Probability, Statistics and Statistical 
Physics, number 158 in Synthese Library, chap- 
ter 4, pages 40-76. D. Reidel, Dordrecht, Holland. 
Philip Resnik. 1992. WordNet and distributional 
analysis: A class-based approach to lexical dis- 
covery. In AAAI Workshop on Statistically- 
Based Natural-Language-Processing Techniques, 
San Jose, California, July. 
Kenneth Rose, Eitan Gurewitz, and Geoffrey C. Fox. 
1990. Statistical mechanics and phase transitions 
in clustering. Physical Review Letters, 65(8):945- 
948. 
Yves Sehabes. 1992. Stochastic lexicalized tree- 
adjoining grammars. In Proceeedings of the 14th 
International Conference on Computational Lin- 
guistics, Nantes, France. 
David Yarowsky. 1992. CONC: Tools for text corpora. 
Technical Memorandum 11222-921222-29, AT&T 
Bell Laboratories. 
