In: Proceedings of CoNLL-2000 and LLL-2000, pages 91-94, Lisbon, Portugal, 2000. 
Inducing Syntactic Categories by Context Distribution Clustering 
Alexander Clark 
School of Cognitive and Computing Sciences 
University of Sussex 
alexc¢cogs, susx. ac. uk 
Abstract 
This paper addresses the issue of the automatic 
induction of syntactic categories from unanno- 
tared corpora. Previous techniques give good 
results, but fail to cope well with ambiguity or 
rare words. An algorithm, context distribution 
clustering (CDC), is presented which can be 
naturally extended to handle these problems. 
1 Introduction 
In this paper I present a novel program that in- 
duces syntactic categories from comparatively 
small corpora of unlabelled text, using only dis- 
tributional information. There are various mo- 
tivations for this task, which affect the algo- 
rithms employed. Many NLP systems use a 
set of tags, largely syntactic in motivation, that 
have been selected according to various criteria. 
In many circumstances it would be desirable for 
engineering reasons to generate a larger set of 
tags, or a set of domain-specific tags for a par- 
ticular corpus. Furthermore, the construction 
of cognitive models of  acquisition- 
that will almost certainly involve some notion 
of syntactic category - requires an explanation 
of the acquisition of that set of syntactic cate- 
gories. The amount of data used in this study 
is 12 million words, which is consistent with a 
pessimistic lower bound on the linguistic experi- 
ence of the infant  learner in the period 
from 2 to 5 years of age, and has had capitalisa- 
tion removed as being information not available 
in that circumstance. 
2 Previous Work 
Previous work falls into two categories. A num- 
ber of researchers have obtained good results 
using pattern recognition techniques. Finch 
and Chater (1992), (1995) and Schfitze (1993), 
(1997) use a set of features derived from the 
co-occurrence statistics of common words to- 
gether with standard clustering and information 
extraction techniques. For sufficiently frequent 
words this method produces satisfactory results. 
Brown et al. (1992) use a very large amount 
of data, and a well-founded information theo- 
retic model to induce large numbers of plausi- 
ble semantic and syntactic clusters. Both ap- 
proaches have two flaws: they cannot deal well 
with ambiguity, though Schfitze addresses this 
issue partially, and they do not cope well with 
rare words. Since rare and ambiguous words are 
very common in natural , these limita- 
tions are serious. 
3 Context Distributions 
Whereas earlier methods all share the same ba- 
sic intuition, i.e. that similar words occur in 
similar contexts, I formalise this in a slightly 
different way: each word defines a probability 
distribution over all contexts, namely the prob- 
ability of the context given the word. If the 
context is restricted to the word on either side, 
I can define the context distribution to be a dis- 
tribution over all ordered pairs of words: the 
word before and the word after. The context 
distribution of a word can be estimated from 
the observed contexts in a corpus. We can then 
measure the similarity of words by the simi- 
larity of their context distributions, using the 
Kullback-Leibler (KL) divergence as a distance 
function. 
Unfortunately it is not possible to cluster 
based directly on the context distributions for 
two reasons: first the data is too sparse to es- 
timate the context distributions adequately for 
any but the most frequent words, and secondly 
some words which intuitively are very similar 
91 
(Schfitze's example is 'a' and 'an') have rad- 
ically different context distributions. Both of 
these problems can be overcome in the normal 
way by using clusters: approximate the context 
distribution as being a probability distribution 
over ordered pairs of clusters multiplied by the 
conditional distributions of the words given the 
clusters : 
p(< Wl, W2 >) -= p(< Cl, C2 >)p(wlICl)p(w2\[c2) 
I use an iterative algorithm, starting with a 
trivial clustering, with each of the K clusters 
filled with the kth most frequent word in the 
corpus. At each iteration, I calculate the con- 
text distribution of each cluster, which is the 
weighted average of the context distributions of 
each word in the cluster. The distribution is cal- 
culated with respect to the K current clusters 
and a further ground cluster of all unclassified 
words: each distribution therefore has (K + 1) 2 
parameters. For every word that occurs more 
than 50 times in the corpus, I calculate the con- 
text distribution, and then find the cluster with 
the lowest KL divergence from that distribution. 
I then sort the words by the divergence from 
the cluster that is closest to them, and select 
the best as being the members of the cluster 
for the next iteration. This is repeated, grad- 
ually increasing the number of words included 
at each iteration, until a high enough propor- 
tion has been clustered, for example 80%. Af- 
ter each iteration, if the distance between two 
clusters falls below a threshhold value, the clus- 
ters are merged, and a new cluster is formed 
from the most frequent unclustered word. Since 
there will be zeroes in the context distributions, 
they are smoothed using Good-Turing smooth- 
ing(Good, 1953) to avoid singularities in the KL 
divergence. At this point we have a preliminary 
clustering - no very rare words will be included, 
and some common words will also not be as- 
signed, because they are ambiguous or have id- 
iosyncratic distributional properties. 
4 Ambiguity and Sparseness 
Ambiguity can be handled naturally within 
this framework. The context distribution p(W) 
of a particular ambiguous word w can be 
modelled as a linear combination of the con- 
text distributions of the various clusters. We 
can find the mixing coefficients by minimising 
D(p(W)ll (w) a~w) oLi qi) where the are some co- 
efficients that sum to unity and the qi are the 
context distributions of the clusters. A mini- 
mum of this function can be found using the 
EM algorithm(Dempster et al., 1977). There 
are often several local minima - in practice this 
does not seem to be a major problem. 
Note that with rare words, the KL divergence 
reduces to the log likelihood of the word's con- 
text distribution plus a constant factor. How- 
ever, the observed context distributions of rare 
words may be insufficient to make a definite de- 
termination of its cluster membership. In this 
case, under the assumption that the word is 
unambiguous, which is only valid for compar- 
atively rare words, we can use Bayes's rule to 
calculate the posterior probability that it is in 
each class, using as a prior probability the dis- 
tribution of rare words in each class. This in- 
corporates the fact that rare words are much 
more likely to be adjectives or nouns than, for 
example, pronouns. 
5 Results 
I used 12 million words of the British Na- 
tional Corpus as training data, and ran this al- 
gorithm with various numbers of clusters (77, 
100 and 150). All of the results in this paper 
are produced with 77 clusters corresponding to 
the number of tags in the CLAWS tagset used 
to tag the BNC, plus a distinguished sentence 
boundary token. In each case, the clusters in- 
duced contained accurate classes corresponding 
to the major syntactic categories, and various 
subgroups of them such as prepositional verbs, 
first names, last names and so on. Appendix A 
shows the five most frequent words in a cluster- 
ing with 77 clusters. In general, as can be seen, 
the clusters correspond to traditional syntactic 
classes. There are a few errors - notably, the 
right bracket is classified with adverbial parti- 
cles like "UP". 
For each word w, I then calculated the opti- 
mal coefficents c~ w). Table 1 shows some sam- 
ple ambiguous words, together with the clusters 
with largest values of c~ i. Each cluster is repre- 
sented by the most frequent member of the clus- 
ter. Note that "US" is a proper noun cluster. 
As there is more than one common noun clus- 
ter, for many unambiguous nouns the optimum 
is a mixture of the various classes. 
92 
Word Clusters 
ROSE 
VAN 
MAY 
US 
HER 
THIS 
CAME CHARLES GROUP 
JOHN TIME GROUP 
WILL US JOHN 
YOU US NEW 
THE YOU 
THE IT LAST 
Table 1: Ambiguous words. For each word, the 
clusters that have the highest a are shown, if 
a > 0.01. 
Model 
Freq 
1 0.66 0.21 
2 0.64 0.27 
3 0.68 0.36 
5 0.69 0.40 
10 0.72 0.50 
20 0.73 0.61 
CDC Brown CDC Brown 
NN1 NN1 A J0 A J0 
0.77 0.41 
0.77 0.58 
0.82 0.73 
0.83 0.81 
0.92 0.94 
0.91 0.94 
Table 2: Accuracy of classification of rare words 
with tags NN1 (common noun) and A J0 (adjec- 
tive). 
Table 2 shows the accuracy of cluster assign- 
ment for rare words. For two CLAWS tags, A J0 
(adjective) and NNl(singular common noun) 
that occur frequently among rare words in the 
corpus, I selected all of the words that oc- 
curred n times in the corpus, and at least half 
the time had that CLAWS tag. I then tested 
the accuracy of my assignment algorithm by 
marking it as correct if it assigned the word 
to a 'plausible' cluster - for A J0, either of the 
clusters "NEW" or "IMPORTANT", and for 
NN1, one of the clusters "TIME", "PEOPLE", 
"WORLD", "GROUP" or "FACT". I did this 
for n in {1, 2, 3, 5, 10, 20}. I proceeded similarly 
for the Brown clustering algorithm, selecting 
two clusters for NN1 and four for A J0. This can 
only be approximate, since the choice of accept- 
able clusters is rather arbitrary, and the BNC 
tags are not perfectly accurate, but the results 
are quite clear; for words that occur 5 times or 
less the CDC algorithm is clearly more accurate. 
Evaluation is in general difficult with unsu- 
pervised learning algorithms. Previous authors 
have relied on both informal evaluations of the 
plausibility of the classes produced, and more 
formal statistical methods. Comparison against 
existing tag-sets is not meaningful - one set of 
Test set 1 2 3 4 
CLAWS 411 301 478 413 
Brown et al. 380 252 444 369 
CDC 372 255 427 354 
Mean 
395 
354 
346 
Table 3: Perplexities of class tri-gram models 
on 4 test sets of 100,000 words, together with 
geometric mean. 
tags chosen by linguists would score very badly 
against another without this implying any fault 
as there is no 'gold standard'. I therefore chose 
to use an objective statistical measure, the per- 
plexity of a very simple finite state model, to 
compare the tags generated with this cluster- 
ing technique against the BNC tags, which uses 
the CLAWS-4 tag set (Leech et al., 1994) which 
had 76 tags. I tagged 12 million words of BNC 
text with the 77 tags, assigning each word to 
the cluster with the highest a posteriori proba- 
bility given its prior cluster distribution and its 
context. 
I then trained 2nd-order Markov models 
(equivalently class trigram models) on the orig- 
inal BNC tags, on the outputs from my algo- 
rithm (CDC), and for comparision on the out- 
put from the Brown algorithm. The perplexities 
on held-out data are shown in table 3. As can 
be seen, the perplexity is lower with the model 
trained on data tagged with the new algorithm. 
This does not imply that the new tagset is bet- 
ter; it merely shows that it is capturing statisti- 
cal significant generalisations. In absolute terms 
the perplexities are rather high; I deliberately 
chose a rather crude model without backing off 
and only the minimum amount of smoothing, 
which I felt might sharpen the contrast. 
6 Conclusion 
The work of Chater and Finch can be seen as 
similar to the work presented here given an in- 
dependence assumption. We can model the con- 
text distribution as being the product of inde- 
pendent distributions for each relative position; 
in this case the KL divergence is the sum of 
the divergences for each independent distribu- 
tion. This independence assumption is most 
clearly false when the word is ambiguous; this 
perhaps explains the poor performance of these 
algorithms with ambiguous words. The new 
algorithm currently does not use information 
93 
about the orthography of the word, an impor- 
tant source of information. In future work, I will 
integrate this with a morphology-learning pro- 
gram. I am currently applying this approach 
to the induction of phrase structure rules, and 
preliminary experiments have shown encourag- 
ing results. 
In summary, the new method avoids the limi- 
tations of other approaches, and is better suited 
to integration into a complete unsupervised lan- 
guage acquisition system. 

References 
Peter F. Brown, Vincent J. Della Pietra, Peter V. 
de Souza, Jenifer C. Lai, and Robert Mercer. 
1992. Class-based n-gram models of natural lan- 
guage. Computational Linguistics, 18:467-479. 
A. P. Dempster, N. M. Laird, and D. B. Rubin. 1977. 
Maximum likelihood from incomplete data via the 
EM algorithm. Journal o/ the Royal Statistical 
Society Series B, 39:1-38. 
S. Finch and N. Chater. 1992. Bootstrapping syn- 
tactic categories. In Proceedings o/ the l~th An- 
nual Meeting of the Cognitive Science Society, 
pages 820-825. 
S. Finch, N. Chater, and Redington M. 1995. Ac- 
quiring syntactic information from distributional 
statistics. In Joseph P. Levy, Dimitrios Bairak- 
taris, John A. Bullinaria, and Paul Cairns, edi- 
tors, Connectionist Models o/Memory and Lan- 
guage. UCL Press. 
I. J. Good. 1953. The population frequencies of 
species and the estimation of population parame- 
ters. Biometrika, 40:237-264. 
G. Leech, R. Garside, and M Bryant. 1994. 
CLAWS4: the tagging of the British National 
Corpus. In Proceedings o/the 15th International 
Con/erence on Computational Linguistics, pages 
622-628. 
Hinrich Schfitze. 1993. Part of speech induction 
from scratch. In Proceedings o/ the 31st an- 
nual meeting o/ the Association /or Computa- 
tional Linguistics, pages 251-258. 
Hinrich Schfitze. 1997. Ambiguity Resolution in 
Language Learning. CSLI Publications. 
