File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/w00-0717_metho.xml

Size: 4,756 bytes

Last Modified: 2025-10-06 14:07:23

<?xml version="1.0" standalone="yes"?>
<Paper uid="W00-0717">
  <Title>Inducing Syntactic Categories by Context Distribution Clustering</Title>
  <Section position="5" start_page="0" end_page="91" type="metho">
    <SectionTitle>
3 Context Distributions
</SectionTitle>
    <Paragraph position="0"> Whereas earlier methods all share the same basic intuition, i.e. that similar words occur in similar contexts, I formalise this in a slightly different way: each word defines a probability distribution over all contexts, namely the probability of the context given the word. If the context is restricted to the word on either side, I can define the context distribution to be a distribution over all ordered pairs of words: the word before and the word after. The context distribution of a word can be estimated from the observed contexts in a corpus. We can then measure the similarity of words by the similarity of their context distributions, using the Kullback-Leibler (KL) divergence as a distance function.</Paragraph>
    <Paragraph position="1"> Unfortunately it is not possible to cluster based directly on the context distributions for two reasons: first the data is too sparse to estimate the context distributions adequately for any but the most frequent words, and secondly some words which intuitively are very similar  (Schfitze's example is 'a' and 'an') have radically different context distributions. Both of these problems can be overcome in the normal way by using clusters: approximate the context distribution as being a probability distribution over ordered pairs of clusters multiplied by the conditional distributions of the words given the</Paragraph>
    <Paragraph position="3"> I use an iterative algorithm, starting with a trivial clustering, with each of the K clusters filled with the kth most frequent word in the corpus. At each iteration, I calculate the context distribution of each cluster, which is the weighted average of the context distributions of each word in the cluster. The distribution is calculated with respect to the K current clusters and a further ground cluster of all unclassified words: each distribution therefore has (K + 1) 2 parameters. For every word that occurs more than 50 times in the corpus, I calculate the context distribution, and then find the cluster with the lowest KL divergence from that distribution.</Paragraph>
    <Paragraph position="4"> I then sort the words by the divergence from the cluster that is closest to them, and select the best as being the members of the cluster for the next iteration. This is repeated, gradually increasing the number of words included at each iteration, until a high enough proportion has been clustered, for example 80%. After each iteration, if the distance between two clusters falls below a threshhold value, the clusters are merged, and a new cluster is formed from the most frequent unclustered word. Since there will be zeroes in the context distributions, they are smoothed using Good-Turing smoothing(Good, 1953) to avoid singularities in the KL divergence. At this point we have a preliminary clustering - no very rare words will be included, and some common words will also not be assigned, because they are ambiguous or have idiosyncratic distributional properties.</Paragraph>
  </Section>
  <Section position="6" start_page="91" end_page="91" type="metho">
    <SectionTitle>
4 Ambiguity and Sparseness
</SectionTitle>
    <Paragraph position="0"> Ambiguity can be handled naturally within this framework. The context distribution p(W) of a particular ambiguous word w can be modelled as a linear combination of the context distributions of the various clusters. We can find the mixing coefficients by minimising D(p(W)ll (w) a~w) oLi qi) where the are some coefficients that sum to unity and the qi are the context distributions of the clusters. A minimum of this function can be found using the EM algorithm(Dempster et al., 1977). There are often several local minima - in practice this does not seem to be a major problem.</Paragraph>
    <Paragraph position="1"> Note that with rare words, the KL divergence reduces to the log likelihood of the word's context distribution plus a constant factor. However, the observed context distributions of rare words may be insufficient to make a definite determination of its cluster membership. In this case, under the assumption that the word is unambiguous, which is only valid for comparatively rare words, we can use Bayes's rule to calculate the posterior probability that it is in each class, using as a prior probability the distribution of rare words in each class. This incorporates the fact that rare words are much more likely to be adjectives or nouns than, for example, pronouns.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML