Hierarchical Clustering of Words 
Akira Ushioda* 
ATR h:lt(;rth'e.1;h:lg T(;l('(:omnnlni(;&l;ions Res('ax(:.h I,M)ort~tories 
2-2 ltika,ridM, S(;ika-(:ho, SorMcu-gun, Kyoto, ,\]a,l:)~m 6\] 9-02 
ema,i\]: ushioda(a)itl, air. co. jp 
Abstract 
This plq)er (lescril)es a (hit i~-(triven 
nlet, hod for hiera, rchicM chlstering of 
words ill whicii a, la, rge vo(:aJ)ul~ry of I,;ii. 
glis\]'l words is (:histered botl;oln--uf) > with 
resl)e(:t 1,o (:orpor;~ ranghig in size fi'otn 
5 to 50 nlillion wor(ts, using a greedy al 
gorithm that I;ries I,o nliniluize i~veri~ge 
lOS8 Of liCllltllal iriforuu:l,l, ion of a, djax:ent 
classes. The resulting hierar('.hi('al (:illS- 
tiers of woMs are then tumirMly 1,rans- 
rorlned to a bit-string representld, ion of 
(i.e. word bits for) all the words ill the vo- 
cabulary, Introducing wor(l bits hito i.he 
ATI{ I)ecision-Tree DOS Tagger is shown 
to signific~mt,ly reduce l, he ti~gging error 
rld;e. PortM)ility of word t)il.s h:om Olle 
(tonlMn to i~Hotilel: iS ~tlSO diss(:ussed. 
1 Introduction 
()lie of bile fulida, rlrient~J issues concernhlg corpus-- 
l)ased NI,P is t;he (tmLa 8I)a, rsetless prot)len'l. In 
view of the eft'e(',tiveliess of class-ha,seal ll-gl'a, lll \]<%ll-- 
gllage nlodels i~gMnst Lhe (\]~ta s\]7)i~l'Seliess i)rol)lenl 
(Kneser iLli(l Ney 1993), it; is expected t;l-li~t classes 
of words are Mso usefiil for NI,P tasks ill such a 
wi~y that statistics oil (:\]~sses ;tre used whenever 
stal;istics oil individua, l words il, i'e una,vaihdlle or 
unreli&i)le. All ide, al type of clusi, ers for N I,P is 
the ()lie which gu;tra, rltees in ut iia\[ substitu I;M)ilit, y, 
ill tern'is ()f t)oth synl;a,ctic a, ud seltilultic SOUll(l- 
lleSs, &lnOllg words in the sa, rtle class. 
Furthermore, chlstering is nnl(:h more iiseful if 
the clusl;ers i~i'e of vnriMJe grmnllarity, or hierar-- 
chi('al. We will consider i~ tree represent~tl, ion of 
MI the words in t,he vocM)uh~ry in which the root; 
node l:ei)resenl;s the whole vo(:i~l)uli~l'y i~lltl ~ le~f 
llOde rel)rese\[lt;S a, word ill the voclJ)llli~ry. Also, 
~.uiy set of nodes in I;ile tree constil, utes ~ i)m:t,i- 
tion (or cluslx)ring) of the vo(:M)ulary if t;here ex- 
ists one i%ll(I only Olle llode iu i, lle seL ,-%lollg the 
p{~th from the root node ix) ei~(:}l leiff node, In the 
following sectk)n<% we will describe i~ nletliod Or 
crea, th'lg bim~ry tree represeuti~l;ion ()f wor(|s a, ud 
present restllts of ev;tlua,tiilg a, nd conll)aring the 
qualii;y of i;he hierarchi(:M clusters ot)tMne(I fronl 
texl, s ()r W.q:y dilTerent sizes. 
*Calrrellt a.(hlress: Me(lbt lutegr~ttion I,al}or;ttory, 
Fujitsu L;tboral, ories I,td., Ka.w~tsMd, .\]~tpa.n. gtil~til: 
ushiod a(~gfl~d).fiiji tsu.(:o.j p. 
2 Word Bits Construction 
Our word bits coiJstruction <~lgorMlm is ;~ lno(ti- 
flotation mid mi extension of the mutual infornm. 
l, ion chistering Mgorithm proposed })y l}rown et 
ill, (1992). We will first illustrate the dilTereltce 
between file original rormuh~e iul(t the oues we 
used, lind theft introduce the word bits co,.st.ruc- 
tion Mgorithni. We will use the same no(.aA;ion ;ks 
ill Ilrown et M. to tm.Lke the conll);trison e;~sier. 
2.1 Mutual Infi)rmatlon Clust(;rhig 
Algorithm 
MutuM information chlstering niethod enlploys a 
t)ottuni-up merging t)roce,(hire with the ~wel'i%ge 
lllUl.ll&\] illfOrlrllttioll (AMI) or ;sit,cent. classes in 
the text ms an o\[)jective hmction. In the iltitial 
sta, ge, er~(:h word in the vocM)ul<'u'ly of size V is 
i~ssig.,e(l to il;s own (listii,(:t class. We then inerge 
(,wo cla.sses if die merging of {;hem induces \[.i,, 
immn AMI reduction arllong all pMrs of classes, 
ttll(\] we rei)e~d; the nlergitlg 8(,e f) until {,tie Numl)er 
of the (:lasses is reduced to the pre(leliiied nuni- 
t)er C,. Time colitplexity or this basic algorithnl is 
O(V 5) when iinph;rueui,ed sl, rMglitforwardly, l}y 
storing the resu\]\[, of all the trim nierges ~(, (,lie pre- 
vious inerging step, however, the tinie coniplexity 
call be reduced to O(V 3) ;~s shown t)elow. 
Suppos(; dia.t, stm'thig with V ch~sses, we have 
Mre;uiy made V - k nlerges, lelwing k (:lasses, 
(.:~.(J), (:~(2), .. , c.'~(,:)_ The AMI i~t, ~his sti~ge 
is given by the rollowillg e(llmtions. 
I~ : ~ q~(<',,,,) (J) 
q~(l, m) pk(l,',.)log p~:(l,.,) (:~) pl~(l)pr~ (m) 
where p/~(l,m) is the probM)ility that a woM iN 
(,'~(1) is followed l)y i~ word ill C~(m), mid 
pl~(l) : ~_f~l)~(l,,n), pr,:(m)-= ~_~l,~:(l,,n). 
tr~ 
In equ;-~tion \[, qh's ;~re sunlrHed over l, he entire k X 
k (:lass bigrlun table ill which (l,lu) cell rel)reseilts 
qx+,(f,m), hi this irlerging step we invesi;igate a 
trial merge or c'~(i) mM (:~(j) tbr MI (:h~ss pairs 
(i,j), lind con,pure the AMI reduction L~(i,j) 
I~: - l~:(i,j) efre(:i~ed by this .~erge, where l~:(i,j) 
is tile AMI aft;or the lilertre,. 
Suppose that the I);dr (Cx:(i),C);(j)) was cho 
sen to merge, thai. is, l,~(i,j) ~ L~,(l,m) for M1 
1\].59 
pairs (l,m). In the next merging st.el) , we have 
L~;'J)tl m) for all the pairs (l,m). to cMculate ~-1~ , 
ltere we use the superscript (i, j) to indicate that 
(Ck (i), Ck (j)) w as merged in the previous merging 
step. 
Now note that the difference be- 
tween L (i'j)(l m) and L~(l,m) only comes fronl k-1 ~" 
the terms which are affected by mergiug the pair 
(C~(i),C~:(j)). 
Since L~.(l, rn) = I~-1~(I, m) and L"'J)(l m) = k,-l~ , 
- we have 
-- , -- ( l(i'J) -- lk ), m)) + 
Some part of the summation region of I~'j~)(l, ,n) 
and I~ cancels out with a part of l~i;~ ) or a part 
of a(t,.,). Let "0, i (t .0, i and 
i~ denote the values of l~iLJ)(l, rn),lt:(l,m),l~i'_J 1) 
and I~, respectively, after all the common terms 
among theln which Call be canceled are canceled 
out. Then, we have 
L(i'J)tl m)-- Lk(l,m) = k-lk , 
where 
\[(i'J)ll m) 
k-1 ~ ' = q~-l(l+ rn,i) + qk-l(i,l+ m) 
= q~(l+m,i)+q~(i,l+m)+ 
q~(l + re, j) + q~(j,l + m) 
= q~_l(i,l) + q~_l(i,m) + 
qk-l(l, i) + qk-l(m, i) 
= qx:(i,l) + q~(i,m) + q~(j,l) + 
q~(j, rn) + qk(l,i) + qk(l,j) + 
q~(m, i) + qk(rn,j) 
Because equation 3 is expressed as the summation 
of a fixed number of q's, its value can be cMculated 
in constant time, whereas the cMculation of e(tua- 
tion 1 requires O(V 2) time. Therefore, the total 
time complexity is reduced by O(V~). 
The summation regions of I's in equation 3 are 
illustrated in Figure 1. Brown et al. seem to have 
ignored the second term of the right hand side of 
equation 3 and used only the first term to calculate 
L~i,J~(l,m)_Lk(l,m) 1. However, since thesecoud 
term has as much weight as the first terln, we used 
equation 3 to mgke the model complete. 
Even with the O(V a) algorithm, the calculation 
is not practical for a large vocabulary of order 10 4 
or higher. Brown et al. proposed the following 
1A(:tually, it is the first term of equation 3 times 
(-l) that appeared in their paper, but we believe that 
it is simply due to a misprint. 
i j 1 m 
J :::::::::::::::::::::::::::::: 
, 
i 
J 
l+n 
i j l+m 
:::::::::::::::::::::::::::::: 
:::::::::::::::::::::::::::::: il il 
:::::::::::::::::::::::::::: 
\[i !! \[i 
ik ik(Lm) 
========================== .... 
::::::::::: 
(~) l+m k-I 
1 :::::::::::::::::::::::::::::: 
k-it ii 
i~,j!l ~'(i,j) :l In _. Ik_ D, ~ ) 
Figure 1: Summation Regions for l's 
Merging History: Dendrogram 
Merge(A, B -> A) :=> 
Merge(C, D -> c) __~__ 
Merge(A, C -> A) 
Merge(X,Y->Z) reads (A "merge X and Y and name 
the new class as Z" 
Figure 2: I)endrogram Construction 
method, which we also adopted. We first make V 
singleton classes out of the V words in the vocab- 
ulary and arrange the (:lasses in descending order 
of frequency, then define the merging region as the 
first C+ l positions in the sequence of classes. At 
each merging step, merging of only the (:lasses in 
the merging region is considered, thus reducing 
the number of trial merges from O(V 2) to O(C'~). 
After each actual merge, the most frequent single- 
ton class outside of the merging region is shifted 
into the region. With this algorithm, the time 
complexity is reduced to O(C ~ V). 
2.2 Word Bits Construction Algorithm 
The simplest way to construct a tree structured 
representation of words is to construct a dendro- 
gram from the record of the merging order. A sim- 
ple example with a five-word vocabulary is shown 
in Figure 2. If we apply this method to the above 
O(C'2V) algorithm, however, we obtain for each 
class an extremely unbalanced, Mmost left branch- 
ing subtree. The reason is that after classes in the 
merging region are grown to a certain size, it is 
much less expensive, in terms of AM1, to merge a 
singleton class with lower frequency into a higher 
frequency class than merging two higher frequency 
classes with substantiM sizes. 
A new approach we adopted is as follows. 
1. Ml-clustering: Make C classes using the mutual 
information clustering algorithm with the merging 
1160 
l 
"ORJ 
Figure 3: Sanlple Sub(.ree for One (:lass 
region constraint mentioned in (2.1). 
2. Outer-clustering: Replace all words in the text 
with their class token 2 and execute binary merg- 
ing without l;lle merging region constraint until all 
the classes are merged into a singe (:lass. Make a 
dendrogram out of this process. This dendrograrn, 
1),.oo¢, constitutes the upper part of the final tree. 
a. l,),~.,.-,:~,,s~.,.i,,,j: Let {C(I), C(2),..., c(c)} ))e 
the set of the classes obtained at, step l. l,'or each 
i (1 < i < C) do the following. 
(3) Replace all words in the text except those in 
C(i) with their <:lass token, l)efine a new vocabu-- 
lary V' = V1 U V> where V1 = {all the words in 
(\](i)}, V 2 = {C'l,(\]2,...,C.,i_l,C'i+l,C,c} , and 65 
is a token for (:(j) for I < j < C. Assign each 
element in V' to its own class and execute binm:y 
merging with a merging constraint such (,ha(, only 
those classes which only contaiu elements of Vl 
can be merged. 
(b) Repeat merging until all the eletuents in VI 
are i)ut in a single (:lass. 
Make a dendrogrmn l).~,d~ out of the merging pro- 
tess for each class. This (teudrogram coust, itutes a 
subtree for each (:lass with a leaf node rel)resent- 
ing each word in the class. 
4. Combine tile dendrograms by substituting each 
leaf node of l)root with coresponding l),,Lb 
This algorithm produces a b,~lanced binary tree 
represent;ation of words in which (,hose words 
which are close in meaning or syntactic feature 
come close in posit, ion. Figure 3 shows an exam- 
pie of l).,~b for orle class out of 500 (:lasses con- 
structed using this algorithm wit|) a vocabulary 
of the 70,000 most; frequently occm:ring words in 
the Wall Street; Journal Corpus. Finally, by trac- 
ing the path from the root node to a leaf node aud 
assigning a bit to each bra, uch with zero or one rep- 
resenting a left or right branc\]b respectively, we 
car, assign a bit-string (word bits) to each word in 
the vocabulary. 
~\]n the actuM implement~ttion, we only htwe to 
work on the bigr~ml t*Lble instead of tim whole text. 
Event- 128: 
{( wo,'cl(O),"like" } { wore\]{-1). "flies" } ( WOl'd(-2),"ti,nt," } 
W()l'(l(l),ll\[l|/" } ( .... '(|(~), u{t\['I'()VVu } 
tag(d),"Ve|.b-ard-Sg-lype3"} ( tag( 2),"Nou,,-Sg typeld" ) 
........ (Basic Questions) 
Inclass?(word(0), Clasu295), "yes" } 
Wo|'dBits(~cVord(-1), 29), "1" } 
........ ( V~rOl'Cl Bi, s Quesl;ions) 
IsPrefix?(Word(0), "anl;i"), "no" } 
........ (Linguist's Qt|estions ) 
Tag, "Prep-type5" } } 
Figure 4: Examt)le of a.n event 
3 Experiments 
We used phdu texts from six years of tile WSJ 
C, ort)us to create word bits. The sizes of tile texts 
are 5 million words (MW), t0MW, 20MW, and 
50M W. '|'he vocabulary is selected as the 70,000 
most; fl:eqneutly occurring words in the entire co> 
pus. We set the number C of <:lasses to 500. 
The obtained hierarchical clusters are ewdua.ted 
via the error rate of the ATI{ l)ecision-Tree Part-- 
Of-Speech Tagger which is based on SPAT'\['I,;t{ 
(Magerman 199,1). The tagger employs a set of 
443 syntactic tags. In the training phase, a set of 
events are extracted from the training texts. An 
event is a set of feature-value pairs or question- 
answer pairs. A feature can be any attribute of 
the context in which the current word word(O) ap- 
pears; it is conveniently expressed as a question. 
Figure 4 shows an example of an evetlt, with a cur- 
rent word "like". The last \[)air in the event is a 
special item which shows the answer, i.e., the col 
rect tag of the current word. The first three lines 
show questions about identity of words around tile 
current word and tags for previous words. These 
questions are cMled basic que.slio~,s and always 
used. The second type of questions, word bits 
questions, are on clusters and word bits such as 
what is the 29th bit of the previous word's word 
bits?. The third type of questkms are cMled lin- 
gui.sl's questiona and these are compiled by an ex- 
pert grmlmmrian. 
Out of the set of events, a decision tree is 
constructed whose leaf nodes contain conditionM 
probability distributi(ms of tags, conditioned by 
the feature values. In tile test phase the system 
looks up conditionM probability distributions of 
tags R)r eat:l, word in the test text and chooses the 
most probable tag sequences using beam search. 
We used WSJ texts and the ATI{ cor\[ms (lllack 
et al. 1996) for the tagging experiment. Both col 
pora use the ATR syntactic tag set. Since the 
ATR corpus is still in the process of development, 
the size of the texts we have at hand for this ex- 
periment is rather ndnimal considering tim large 
size of the tag set. Table 1 shows the sizes of texts 
used for the experiment;. Figure 5 shows the t;ag- 
ging error rat;es plotted against various clustering 
1161 
~/ Text: wsJ Text 
261 • WordBits 0 Lh~llQ~t & WordBits 
24I\ Text: ATR Corp: 
20~ Reshuffled (WSJ Text) 
= \]\ '~ E1 WordBits 
16 ...... "~ .......... 
10 \[ d5 xS 
10 20 30 410 ~0 
Clustering Text Size (Million Words) 
l,'igure 5: Tagging Error Rate 
60 
Text Size (words) Training q'est Ileht-Out 
WSJ Text 75,139 5,831 6,534 
"ATR Text 76,132 23,163 6,68"0 
Table 1: Texts for Tagging Experiments 
text sizes. Out of the three types of questions, ba- 
sic questions and word bits questions are always 
used in this ext)eriment. 'lb see the effect of in- 
troducing word bits information into the tagger, 
we performed a separate experiment in which a 
randomly generated bit-string is assigned to each 
word 3 and basic questions and word bits questions 
are used. The results are plotted at zero clustering 
text size. For both WSJ texts and ATR corpus, 
the tagging error rate dropped by more than 30% 
when using word bits information extracted from 
the 5MW text, and increasing the clustering text 
size further decreases the error rate. At 50MW, 
the error rate drops by 43%. This shows the ira: 
provement of the quality of the hierarchical clus- 
ters with increasing size of the clustering text. In 
Figure 5, introduction of linguistic questions 4 is 
also shown to significantly reduce the error rates 
for the WSJ corpus. The dependency of the er- 
ror rates on the clustering text size is quite sin> 
liar to the ea.se in which no linguistic questions 
are used, indicating the effectiveness of combin- 
3Since a distin<:tive bit-string is assigned to each 
word, the tagger also uses a bit-string as an ID number 
for each word in the process, In this control experi- 
ment bit-strings are assigned in a random way, but no 
two words are assigned the same word lilts. Random 
word bits are expected to give no class information to 
the tagger except for the identity of words. 
4The linguistic questions we used her(.' are still in 
the initial stage of development and are by no means 
comprelmnsive. 
ing automatically created word bits and hand- 
crafted linguistic questions. Figure 5 also shows 
that reshuming the classes several times just after 
step I (MLclustering) of the word bits construc- 
tion process filrther improves the word bits. One 
round of reshuffling corresponds to moving each 
word in the vocabulary from its original (:lass to 
another class whenever the movement increases 
the AMI, starting from the most frequent word 
through the least frequent one. The figure shows 
the error rates with zero, two, and five rounds 
of reshufi\]ing 5. Overall high error rates are at- 
tributed to the very large tag set; and the small 
training set. Another notable point in the figure is 
that introducing word bits constructed from WSJ 
texts is as effective for tagging Aq'R text.s as it is 
for tagging WSJ texts even though these texts are 
from very different domains. To \[;hat extent, the 
obtained hierarchical clusters are considered to be 
portable across domains. 
4 Conclusion 
We presented an algorithm for hierarchical <:has: 
tering of words, and conducted a clustering exper- 
iment using large texts of:varying sizes. High qtml- 
ity of the obtained clusters are confirmed by the 
POS tagging experiments. By introducing word 
bits into the ATR l)ecision-Tree POS Tagger, the 
tagging error rate is reduced by up to 43%. The 
hierarchical clusters obtained fi'orn WSJ texts are 
also shown to be usefld \['or tagging ATR texts 
which are fi'om quite different domMns than WSJ 
texts. 
Acknowledgements 
We thank John Lafferty for his helpful suggestions. 
References 
Black, E., Eubank, S., Kashioka, H., Magerman, D., 
Garside, R., and Leech, G. (1996) "Beyond Skelc- 
ton Parsing: Producing a Comprehensive Large-Scale 
General-English Treebank With Full Grammatical 
Analysis". Proceedings of the 16th International C'on- 
ference on Computational Linguistics. 
Brown, P., Della Pietra, V., deSouza, P., Lai, J., Mer- 
cer, R. (1992) "Class-Based n-gram Models of Natural 
Language". Computational Linguistics, Vol. 18, No 4, 
pp. 467 479. 
Kneser, R. and Ney, H. (71993) "hnproved Clustering 
Techniques for C, lass-Bascd Statistical Language Mod- 
elling". Proceedings <>f European C, onfl'n'cnce on Spee<:h 
Communication and Technology. 
• Magerman, D. (1994) Nabural Language Parsing as 
Statistical Pattern Recognition. Doctoral dissertation. 
Stanford University, Stanford, California. 
~'The vocabulary used for tile reshuffling experi- 
ments is the one used for a preliminary e.xperimcnt 
and its size is 63850. 
1162 
