File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/93/p93-1024_metho.xml

Size: 15,444 bytes

Last Modified: 2025-10-06 14:13:31

<?xml version="1.0" standalone="yes"?>
<Paper uid="P93-1024">
  <Title>DISTRIBUTIONAL CLUSTERING OF ENGLISH WORDS</Title>
  <Section position="4" start_page="183" end_page="186" type="metho">
    <SectionTitle>
THEORETICAL BASIS
</SectionTitle>
    <Paragraph position="0"> In general, we are interested in how to organize a set of linguistic objects such as words according to the contexts in which they occur, for instance grammatical constructions or n-grams. We will show elsewhere that the theoretical analysis outlined here applies to that more general problem, but for now we will only address the more specific problem in which the objects are nouns and the contexts are verbs that take the nouns as direct objects.</Paragraph>
    <Paragraph position="1"> Our problem can be seen as that of learning a joint distribution of pairs from a large sample of pairs. The pair coordinates come from two large sets ./kf and 12, with no preexisting internal structure, and the training data is a sequence S of N independently drawn pairs Si = (ni, vi) 1 &lt; i &lt; N .</Paragraph>
    <Paragraph position="2"> From a learning perspective, this problem falls somewhere in between unsupervised and supervised learning. As in unsupervised learning, the goal is to learn the underlying distribution of the data. But in contrast to most unsupervised learning settings, the objects involved have no internal structure or attributes allowing them to be compared with each other. Instead, the only information about the objects is the statistics of their joint appearance. These statistics can thus be seen as a weak form of object labelling analogous to supervision. null</Paragraph>
    <Section position="1" start_page="184" end_page="184" type="sub_section">
      <SectionTitle>
Distributional Clustering
</SectionTitle>
      <Paragraph position="0"> While clusters based on distributional similarity are interesting on their own, they can also be profitably seen as a means of summarizing a joint distribution. In particular, we would like to find a set of clusters C such that each conditional distribution pn(v) can be approximately decomposed as</Paragraph>
      <Paragraph position="2"> where p(c\[n) is the membership probability of n in c and pc(v) = p(vlc ) is v's conditional probability given by the centroid distribution for cluster c.</Paragraph>
      <Paragraph position="3"> The above decomposition can be written in a more symmetric form as</Paragraph>
      <Paragraph position="5"> assuming that p(n) and /5(n) coincide. We will take (1) as our basic clustering model.</Paragraph>
      <Paragraph position="6"> To determine this decomposition we need to solve the two connected problems of finding suitable forms for the cluster membership p(c\[n) and the centroid distributions p(vlc), and of maximizing the goodness of fit between the model distribution 15(n, v) and the observed data.</Paragraph>
      <Paragraph position="7"> Goodness of fit is determined by the model's likelihood of the observations. The maximum likelihood (ML) estimation principle is thus the natural tool to determine the centroid distributions pc(v). As for the membership probabilities, they must be determined solely by the relevant measure of object-to-cluster similarity, which in the present work is the relative entropy between object and cluster centroid distributions. Since no other information is available, the membership is determined by maximizing the configuration entropy for a fixed average distortion. With the maximum entropy (ME) membership distribution, ML estimation is equivalent to the minimization of the average distortion of the data. The combined entropy maximization entropy and distortion minimization is carried out by a two-stage iterative process similar to the EM method (Dempster et al., 1977). The first stage of an iteration is a maximum likelihood, or minimum distortion, estimation of the cluster centroids given fixed membership probabilities. In the second stage of each iteration, the entropy of the membership distribution is maximized for a fixed average distortion. This joint optimization searches for a saddle point in the distortion-entropy parameters, which is equivalent to minimizing a linear combination of the two known as free energy in statistical mechanics.</Paragraph>
      <Paragraph position="8"> This analogy with statistical mechanics is not coincidental, and provides a better understanding of the clustering procedure.</Paragraph>
    </Section>
    <Section position="2" start_page="184" end_page="185" type="sub_section">
      <SectionTitle>
Maximum Likelihood Cluster
Centroids
</SectionTitle>
      <Paragraph position="0"> For the maximum likelihood argument, we start by estimating the likelihood of the sequence S of N independent observations of pairs (ni,vi). Using (1), the sequence's model log likelihood is</Paragraph>
      <Paragraph position="2"> Fixing the number of clusters (model size) Icl, we want to maximize l(S) with respect to the distributions P(nlc ) and p(vlc). The variation of l(S) with respect to these distributions is</Paragraph>
      <Paragraph position="4"> since ~flogp -- @/p. This expression is particularly useful when the cluster distributions p(n\[c) and p(vlc ) have an exponential form, precisely what will be provided by the ME step described below.</Paragraph>
      <Paragraph position="5"> At this point we need to specify the clustering model in more detail. In the derivation so far we have treated, p(n c) and p(v c) symmetrically, corresponding to clusters not of verbs or nouns but of verb-noun associations. In principle such a symmetric model may be more accurate, but in this paper we will concentrate on asymmetric mod- els in which cluster memberships are associated to just one of the components of the joint distribution and the cluster centroids are specified only by the other component. In particular, the model we use in our experiments has noun clusters with cluster memberships determined by p(nlc) and centroid distributions determined by p(vlc ).</Paragraph>
      <Paragraph position="6"> The asymmetric model simplifies the estimation significantly by dealing with a single component, but it has the disadvantage that the joint distribution, p(n, v) has two different and not necessarily consistent expressions in terms of asymmetric models for the two coordinates.</Paragraph>
      <Paragraph position="7"> 2As usual in clustering models (Duda and Hart, 1973), we assume that the model distribution and the empirical distribution are interchangeable at the solution of the parameter estimation equations, since the model is assumed to be able to represent correctly the data at that solution point. In practice, the data may not come exactly from the chosen model class, but the model obtained by solving the estimation equations may still be the closest one to the data.</Paragraph>
    </Section>
    <Section position="3" start_page="185" end_page="186" type="sub_section">
      <SectionTitle>
Maximum Entropy Cluster Membership
</SectionTitle>
      <Paragraph position="0"> While variations of p(nlc ) and p(vlc ) iri equation (4) are not independent, we can treat them separately. First, for fixed average distortion between the cluster centroid distributions p(vlc ) and the data p(vln), we find the cluster membership probabilities, which are the Bayes inverses of the p(nlc), that maximize the entropy of the cluster distributions. With the membership distributions thus obtained, we then look for the p(vlc ) that maximize the log likelihood l(S). It turns out that this will also be the values ofp(vlc) that minimize the average distortion between the asymmetric cluster model and the data.</Paragraph>
      <Paragraph position="1"> Given any similarity measure din , c) between nouns and cluster centroids, the average cluster distortion is</Paragraph>
      <Paragraph position="3"> If we maximize the cluster membership entropy</Paragraph>
      <Paragraph position="5"> subject to normalization ofp(nlc) and fixed (5), we obtain the following standard exponential forms (Jaynes, 1983) for the class and membership dis-</Paragraph>
      <Paragraph position="7"> where the normalization sums (partition functions) are Z~ = ~,~ exp-fld(n,c) and Zn = ~exp-rid(n,c). Notice that d(n,c) does not need to be symmetric for this derivation, as the two distributions are simply related by Bayes's rule.</Paragraph>
      <Paragraph position="8"> Returning to the log-likelihood variation (4), we can now use (7) for p(n\[c) and the assumption for the asymmetric model that the cluster membership stays fixed as we adjust the centroids, to</Paragraph>
      <Paragraph position="10"> where the variation of p(v\[c) is now included in the variation of d(n, e).</Paragraph>
      <Paragraph position="11"> For a large enough sample, we may replace the sum over observations in (9) by the average over N</Paragraph>
      <Paragraph position="13"> At the log-likelihood maximum, this variation must vanish. We will see below that the use of relative entropy for similarity measure makes 6 log Zc vanish at the maximum as well, so the log likelihood can be maximized by minimizing the average distortion with respect to the class centroids while class membership is kept fixed</Paragraph>
      <Paragraph position="15"> or, sufficiently, if each of the inner sums vanish ~ p(nlcl6d(n,c)= 0 (10) tee nEAr Minimizing the Average KL Distortion We first show that the minimization of the relative entropy yields the natural expression for cluster</Paragraph>
      <Paragraph position="17"> To minimize the average distortion (10), we observe that the variation of the KL distance between noun and centroid distributions with respect to the centroid distribution p(v\[c), with each centroid distribution normalized by the Lagrange</Paragraph>
      <Paragraph position="19"> Substituting this expression into (10), we obtain , ,~ v p(vlc) Since the ~p(vlc ) are now independent, we obtain immediately the desired centroid expression (11), which is the desired weighted average of noun distributions. null We can now see that the variation (5 log Z~ vanishes for centroid distributions given by (11), since it follows from (10) that</Paragraph>
      <Paragraph position="21"> The Free Energy Function The combined minimum distortion and maximum entropy optimization is equivalent to the minimization of a single function, the free energy</Paragraph>
      <Paragraph position="23"> where (D) is the average distortion (5) and H is the cluster membership entropy (6).</Paragraph>
      <Paragraph position="24">  The free energy determines both the distortion and the membership entropy through</Paragraph>
      <Paragraph position="26"> where T =/~-1 is the temperature.</Paragraph>
      <Paragraph position="27"> The most important property of the free energy is that its minimum determines the balance between the &amp;quot;disordering&amp;quot; maximum entropy and &amp;quot;ordering&amp;quot; distortion minimization in which the system is most likely to be found. In fact the probability to find the system at a given configuration</Paragraph>
      <Paragraph position="29"> so a system is most likely to be found in its minimal free energy configuration.</Paragraph>
    </Section>
    <Section position="4" start_page="186" end_page="186" type="sub_section">
      <SectionTitle>
Hierarchical Clustering
</SectionTitle>
      <Paragraph position="0"> The analogy with statistical mechanics suggests a deterministic annealing procedure for clustering Rose et al., 1990), in which the number of clusters s determined through a sequence of phase transitions by continuously increasing the parameter/? following an annealing schedule.</Paragraph>
      <Paragraph position="1"> The higher is fl, the more local is the influence of each noun on the definition of centroids. Distributional similarity plays here the role of distortion. When the scale parameter fl is close to zero, the similarity is almost irrelevant. All words contribute about equally to each centroid, and so the lowest average distortion solution involves just one cluster whose centroid is the average of all word distributions. As fl is slowly increased, a critical point is eventually reached for which the lowest F solution involves two distinct centroids. We say then that the original cluster has split into the two new clusters.</Paragraph>
      <Paragraph position="2"> In general, if we take any cluster c and a twin c' of c such that the centroid Pc' is a small random perturbation of Pc, below the critical fl at which c splits the membership and centroid reestimation procedure given by equations (8) and (11) will make pc and Pc, converge, that is, c and c' are really the same cluster. But with fl above the critical value for c, the two centroids will diverge, giving rise to two daughters of c.</Paragraph>
      <Paragraph position="3"> Our clustering procedure is thus as follows.</Paragraph>
      <Paragraph position="4"> We start with very low /3 and a single cluster whose centroid is the average of all noun distributions. For any given fl, we have a current set of leaf clusters corresponding to the current free energy (local) minimum. To refine such a solution, we search for the lowest fl which is the critical value for some current leaf cluster splits. Ideally, there is just one split at that critical value, but for practical performance and numerical accuracy reasons we may have several splits at the new critical point. The splitting procedure can then be repeated to achieve the desired number of clusters or model cross-entropy.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="186" end_page="188" type="metho">
    <SectionTitle>
CLUSTERING EXAMPLES
</SectionTitle>
    <Paragraph position="0"> All our experiments involve the asymmetric model described in the previous section. As explained there, our clustering procedure yields for each value of ~ a set CZ of clusters minimizing the free energy F, and the asymmetric model for fl estimates the conditional verb distribution for a noun n by cECB where p(cln ) also depends on ft.</Paragraph>
    <Paragraph position="1"> As a first experiment, we used our method to classify the 64 nouns appearing most frequently as heads of direct objects of the verb &amp;quot;fire&amp;quot; in one year (1988) of Associated Press newswire. In this corpus, the chosen nouns appear as direct object heads of a total of 2147 distinct verbs, so each noun is represented by a density over the 2147 verbs.</Paragraph>
    <Paragraph position="2"> Figure 1 shows the four words most similar to each cluster centroid, and the corresponding wordcentroid KL distances, for the four clusters resulting from the first two cluster splits. It can be seen that first split separates the objects corresponding to the weaponry sense of &amp;quot;fire&amp;quot; (cluster 1) from the ones corresponding to the personnel action (cluster 2). The second split then further refines the weaponry sense into a projectile sense (cluster 3) and a gun sense (cluster 4). That split is somewhat less sharp, possibly because not enough distinguishing contexts occur in the corpus.</Paragraph>
    <Paragraph position="3"> Figure 2 shows the four closest nouns to the centroid of each of a set of hierarchical clusters derived from verb-object pairs involving the 1000 most frequent nouns in the June 1991 electronic version of Grolier's Encyclopedia (10 mil-</Paragraph>
    <Section position="1" start_page="188" end_page="188" type="sub_section">
      <SectionTitle>
Direct Object Pairs
</SectionTitle>
      <Paragraph position="0"> lion words).</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML