Clustering Words with the MDL Principle 
Hang Li and Naoki Abe 
Theory NEC Laboratory, RWCP* 
c/o C&C Research Labora.tories, NEC 
4-1-1 Miya~zaki Miyamae-ku, Ke~wasa.ki, 216 Japan 
{lihang,abe} (@sbl.cl.nee.co.jp 
Abstract 
We address the probhml of automaticMly 
constructing a thesaurus by clustering 
words based on corpus data. We view 
this problem as that of estimating a joint 
distribution over the (:artesian product 
of a partition of a set of nouns and a 
partition of a set of verbs, and propose 
a learning a.lgorithm based on the Min- 
inmm Description Length (MDL) Prin- 
ciple for such estimation. We empiri- 
cally compared the performance of our 
method based on the MDL Principle 
against the Maximum Likelihood Esti- 
mator in word clustering, and found that 
the former outperforms the latter. ~¢Ve 
also evaluated the method by conduct- 
ing pp-attachment disambiguation ex- 
periments using an automaticMly con- 
structed thesaurus. Our experimental 
results indicate that such a thesaurus can 
be used to improve accuracy in disam- 
biguation. 
1 Introduction 
Recently various methods for automatically con- 
structing a thesaurus (hierarchically clustering 
words) based on corpus data. have been proposed 
(Hindle, 1990; Brown et al., 1992; Pereira et al., 
1993; Tokunaga et al., 1995). The realization 
of such an automatic construction method would 
make it possible to a) save the cost of constructing 
a thesaurus by hand, b) do away with subjectivity 
inherent in a hand made thesaurus, and c) make 
it easier to adapt a natural language processing 
system to a new domain. In this paper, we pro- 
pose a new method for automatic construction of 
thesauri. Specifically, we view the problem of au- 
tomatically clustering words as that of estimating 
a joint distributiofl over the Cartesian product of 
a partition of a set of nouns (in general, any set 
of words) and a partition of a set of w:rbs (in gen- 
eral, any set of words), and propose an est.imation 
*Real World Computing Partership 
algorithm using simulated annealing with an en- 
ergy function based on the blinimum Description 
Length (MDL) Principle. The MDL Principle is 
a well-motivated and theoretically sound principle 
for data compression and estimation in informa- 
tion theory and statistics. As a method of sta- 
tisticM estimation MDL is guaranteed to be near 
optimal. 
We empiricMly evMuated the effectiveness of 
our method. In particular, we compared the per- 
formance of an MDL-based sinm\]ated anuealilag 
Mgorithm in hierarchical word clustering against. 
that of one based on the Maximum Likelihood 
Estimator (MLE, for short). We found that 
the MDL-based method performs better than 
the MLE-based method. We also evaluated 
our method by conducting pp-attachment disam- 
biguation experiments using a thesaurus automat- 
ically constructed by it and found that disam- 
biguation results can be improved. 
Since some words never occur in a corpus, and 
thus cannot be reliably classified by a method 
solely based on corpus data, we propose to com- 
bine the use of an automatically constructed the- 
saurus and a hand made thesaurus in disambigua- 
tion. We conducted some experiments in order to 
test the effectiveness of this strategy. Our exper- 
imental results indicate that combining an auto- 
matically constructed thesaurus and a hand made 
thesaurus widens the coverage 1 of our disambigua- 
tion method, while maintaining high accuracy e. 
2 The Problem Setting 
A method of constructing a thesaurus based on 
corpus data usually consists of the following three 
steps: (i) Extract co-occurrence data (e.g. case 
frame data, adjacency data) fl'om a corpus, (ii) 
Starting from a single class (or each word compos- 
ing its own class), divide (or merge) word classes 
1 ~Cover~tge' refers to the proportion (in percentage) 
of test data for which the disambiguat.ion method can 
make a decision. 
2'Accuracy' refers to the success rate, given that 
the disambiguation method makes a decision. 
based Oll the co-occurrence data using 8Ollle Sill> 
ilarity (distance) measure. (The former apl)roach 
is called 'divisive', the latter 'agglomerative'.) (iii) 
Repeat step (ii) until some stopping condition is 
met, to construct a thesaurus (tree). The method 
we propose here consists of the same three st.eps. 
Suppose available to us are frequency data (co- 
occurrence data.) between verbs and their case slot. 
values extracted from a corpus (step (i)). We then 
view the problem of clustering words as that of 
estimating a probabilistic model (representing a. 
probability distribution) tllat generates such data 
We assume that the target model can be de- 
fined in the following way. First, we define a noun 
partition "PA. ~ over a given set of nouns ..'V" and a 
verb partioll "Pv over a given set. of verbs 12. A 
noun partition is any set T'-~ satisfying "P,~ C 2 H, 
Wc~e'&v('i = A/ and VCi, (..) E 7)A.', Ci 0 (/j = O. 
A verb partition 7)v is defined analogously. In 
this paper, we call a member of a noun partition 
'a, llOUll cluster', and a nlenlbe, r of a verb parti- 
tion a ~verb cluster'. We refer to a member of the 
Cartesian product of a noun partition and a verb 
partition ( C "P:v x "Pv ) simply as 'a cluster'. We 
then define a probabilistic model (a joint distribu- 
tion), written I'(C,, (:v), where random variable 
C,, assumes a value fl'om a fizcd nouu partition 
~PX, and C~. a va.lue from a fixed verb partition 
7)v. Within a given cluster, we assume thai each 
element is generated with equal probability, i.e., 
P(c,,,c~,) 
v., E c,,,v,,, E c,,, P(,,,,,,) - IC. x <,1 (t) 
In this paper, we assume that the observed data 
are generaied by a model belonging to the class of 
models just de.scribed, and select a model which 
best explains the data.. As a result of this, we ob- 
tain both noun clusters and verb clusters. This 
problem setting is based on the intuit.lye assump- 
tion that similar words occur in the sa.me context 
with roughly equal likelihood, as is made explicit 
in equation (l). Thus selecting a model which best 
explains the given data is equivalent to finding the 
most appropriate classification of words base(t on 
their co-occurrence. 
3 Clustering with MDL 
We now turn to the question of what. strategy 
(or criterion) we should employ for estimating 
the best model. Our choice is the MDL (Min- 
imum Description I,ength) principle (tlissanen, 
1989), a well-known principle of data compres- 
sion and statistical estimation from inforlnation 
theory. MDI, stipulates that the best probabil- 
ity model for given data is that model which re- 
quires the least cod(: length \['or encoding of the 
model itself, as well as the giwql data relative to 
it a. We refer to the code length for the model 
aWe refer /.he interested reader to eli aml Abe, 
1!195) for explana.tion of ra.tionals behind using the 
as 'the model description h'ngth' and that for tile 
data 'the data description length." 
We apply MDI, to the problem of estimating 
a model consisting of a pair of partitions as de- 
scribed above. In this context, a model with less 
clusters tends to be simpler (in t.erms of the num- 
ber of parameters), but also tends to have a poorer 
fit. to the data. In contrast, a model with more 
clusters is more complex, but tends to have a bet- 
ter fit to the data. Thus, there is a trade-off rela- 
tionship between the simplicity of a model and the 
goodness of fit to the data. The model description 
length quantifies the simplicity (complexity) of a 
model, and the data description length quantifies 
the tlt. to the data. According to MDL, the model 
which minimizes the sum total of the two types of 
description lengths should be selected. 
In what follows, we will describe in detail how 
the description length is to be calculated in our 
current context, as well as our silnulated annealing 
algorithm based on MI)L. 
3.1 Calculating Description Length 
We will now describe how the description length 
for a model is calculated, lh'call that each model 
is specified by the Cartesian product of a partition 
of nouns and a partition of verbs, and a number 
of parameters for them. Here we let /,', denote the 
size of the noun partition, and /q, the size of the 
verb partition. Tiien, there are k,. k~,- 1 free 
parameters in a model. 
Given a model M and data k', its total de- 
scription length L(J/) 4 is COlnputed as the suni 
of the model description length L .... d('lt), the de- 
scription length of its parameters I;~,,,,.(M), and 
data description length Ld,~t(M). (We often refer 
to Lm.od(.'l.\]) q- Lpar (:'~l) as the model description 
length). Namely, 
L(:~'I) = L,,~o(~(:~I) + L>.,,.(:~I) + L~,(M) (2) 
We employ the %inary noun clustering method', 
in which k,, is fixed at IVt and we are to dechle 
whether k,~ -- 1 or k,,. = 2, which is then to be 
applied recursiw~ly to the clusters thus obtained. 
This is as if we view the noutls as entities a.nd the 
verbs as features and cluster the entities based on 
their feat.ures. Since there are 2Pv'I subsets of the 
set of llottns .~, and for each 'binary' noun parti- 
tion we have two different subsets (a special case 
of which is when one subset is A 'r and the other the 
empty set 0), the number of possible binary noml 
partitions is 2tAq/2 = 21~'l-J. Thus for each I)i- 
nary noun partition we need log 21a"l-t = i3j- I _ 1 
bit.s 5 to describe it. 6 Ilenee L ..... a(M) is calculated 
MI)L principle in natural language processing. 
~L(M) depends on .';, but we will leave ,5' implicit. 
5Throughout the paper 'log' denotes the logarit.hnt 
to the base 2. 
6 For further explanation, see (Quinlan and Rivest, 
1989). 
as 7 
L,,~o<+~,s) = I~rl- 1 (3) 
Lpar(k~/), often referred to as the parallleter de- 
scription length, is calculated by, 
L,,~,.(M) = 2 . log I,~'t (4) 
where ISl denotes the input data size, and/¢,. \]c,,- 
1 is the nnnlber of (free) parauleters ill tlle nlodel. 
It is known that using log ~ = ~ bits to de- 
scribe each of the parameters will (approximately) 
minimize the description length (1Rissanen, 1.989). 
FinMly, Ld,t(M) is calculated by 
Ldat(M)=- E f(n,v).logP(n,v) (5) 
(n,v)ES 
where f(n,,v) denotes the observed frequency of 
the noun verb pair (n,v), and P(n,v) the esti- 
mated probability of (n, v), which is calculated as 
follows. 
v,,. c c,,,w, c Cv P(,~,,,~,) - f'((::,,,c'~,) (s) 
' IC,, x c,.I 
P(C,,, C,, ) - f(C,,, C,, ) (r) 
Is1 
where f(C',~, C,,) denotes the obserw.d frequency 
of the noun verb pairs belonging to cluster 
(c,~, <;'~ ). 
With tile description length of a model de- 
fined in the above manner, we wish to select a 
model having the minimum description length and 
output it as the result of clustering. Since the 
model description length Lmod is the same for each 
model, in practice we only need to calculate and 
compare L'(M) = L,,<,,.(M) + \];d<~,(M). 
3.2 A Sinllllated Annealing-based 
Algorithm 
We could ill principle calculate the description 
length for each model and select, a model with 
the nfininmm description length, if COlnputation 
time were of no concern. However, since the num- 
ber of probal)ilistic models under consideration is 
super exponential, this is not feasible in practice. 
We employ the 'simulated a.m~ealing technique' to 
deal with this problem. Figure 1 shows our (divi- 
sive) clustering algorithm s . 
4 Advantages of Our Method 
In this section, we elaborate on the merits of our 
method. 
In. statistical natural language processing, usu- 
ally the number of parameters in a probabilistic 
7The exact formulation of L,~od(M) is subjective, 
and it depends on the exact coding scheme used for 
the description of the models. 
SAs we noted earlier, an Mternative would be to 
employ an agglomerative Mgorithm. 
model to be estimated is very large, and therefore 
such a model is difficult to estimate with a reason- 
able data size that is available in practice. (This 
problem is usually referred to as the 'data sparse- 
ness problem'.) We could smooth the estimated 
probabilities using an existing smoothing tech- 
nique (e.g., (Dagan el, al., 1992; Gale and Church, 
1990)), then calculate some similarity measure us- 
ing the smoothed probabilities, and then cluster 
words according to it. There is no guarantee, 
however, that the employed smoothing method is 
in any way consistent with the clustering method 
used subsequently. Our method based on MDL re- 
solves this issue in a unified fashion. By employing 
models that embody the assumption that words 
belonging to a same class occur in the same con- 
text with equal likelihood, our method achieves 
the smoothing effect as a side effect of the clus- 
tering process, where the domains of smoothing 
coincide with the classes obtained by clustering. 
Thus, the coarseness or fineness of clustering also 
determines the degree of smoothing. All of these 
effects fall out naturally as a corollary of the im- 
peratiw? of 'best possible estimation', the original 
motivation behind the MDL principle. 
in our simulated annealing algorithm, we could 
alternatively employ the Maxinmm Likelihood Es- 
timator (MLE) as criterion for the best prob- 
abilistic model, instead of MDL. MLE, as its 
name suggests, selects a model which maxi- 
mizes the likelihood of the data, that is, /5 = 
a.rg maxp I-\[~¢s P(x). This is equivalent to min- 
infizing the 'data description length' as defined 
in Section 3, i.e. i 5 = arg minp ~,~-~s - log P(x). 
We can see easily that MDL genet:al\[zes MLE, in 
that it also takes into account the complexity of 
the model itself. In the presence of models with 
varying complexity, MLE tends to overfit the data, 
and output; a model that is too complex and tai- 
lored to fit the specifics of the input data. If we 
employ MLE as criterion in our simulated anneal- 
ing algorithm, it. will result in selecting a very fine 
model with many small clusters, most of which 
will have probabilities estimated as zero. Thus, in 
contrast to employing MDL, it will not have the 
effect of smoothing a.t all. 
Purely as a method of estimation as well, the 
superiority of MI)L over MLE is supported by 
convincing theoretical findings (c.f. (Barton and 
Cover, 1991; Yamanishi, 1992)). For instance, the 
speed of convergence of the models selected by 
MDL to the true model is known to be near op- 
tiinal. (The models selected by MDL converge to 
the true model approximately at the rate of 1/s 
where s is the nmnber of parameters in the true 
model, whereas for MLE the rate is l/t, where t is 
the size of the domain, or in our context, the total 
number of elements of N" x V.) 'Consistency' is 
another desirable property of MDL, which is not 
shared by MLE. That is, the number of parame- 
Algorithm: Clustering 
1. Divide the noun set N into two subs0ts. I)efine a probabilistic model consisting of the l)artition 
of nouns si)ecified by the two sul)sets and th(" entire set. of verbs. 
2. do{ 
2.1 Randomly select, one noun, rcmow> it from t.h~; subset it. belongs to and add it. to the other. 
2.2 C.alcuh~tc the description length for the two models (before and after the mow~') as L1 and 
Le, respectively. 
2.3 Viewing the description length as the energy flmction for annealing, let AL = Le - L:. 
If AL < 0, fix the mow~, otherwise ascertain the mowe with probability P = eXl)(-AL/T). 
} while (the description length has decreased during the past 10. INI trials.) 
Itere T is the a.nnealing t.enq.)crat.urc whose initial value, is 1 and updated to be 0.97' after 
10. \]NI trials. 
3. If one of the obtained subset is elul)t,y, t\]ll?ll return the IlOll-Olllpty subset, otherwise recursiw,ly 
apply Clustering on both of the two subsets. 
Figure 1: Simulated annealing algorithm for word clustering 
ters in l;he models selected by MDI~ ('otivorg~" to 
that of the true model (Rissanen, 1989). Both of 
these prol>erties of MI)I, ar~ Oml>irically w'ri/ied in 
our present (;Ollt(?x\[,, as will be show,: in t.ho t:(,xl 
section. In particular, we haw~ compared l,h(' p(u'- 
forn:a.nc0 of employing an M1)L-based simula.ted 
annealing against that of one 1)ascd on M\[,I", ill 
hierarchical woM clust.c'ring. 
5 Experimental Results 
--it. he con:party they we i 
the t:rue model and the estimated model. ('l'hc al- 
gorithm used for MI,E was lhe same as that showJt 
in Figure 1, except the 'data description length' 
replaces the (total) description length' in Sl.ep 2.) 
Figure 3(a) plots the number of obtained IIOlllI 
clusters (leaf nodes in the obtained thesaurus trc~,) 
w?rsus the input data size, aw;raged ow;r 10 trials. 
(The number of noun clusters in the true model 
is 4.) Figure 3(b) plots the KI, distance versus 
the data size, also averaged over l:he san> 10 tri- 
als. The results indicalc that MI)L conw,rges to 
the true Inode\] fasl.er i.\]ian M I,E. Also, MI,I'; tends 
to select a mo(h'l overfittil:g the data, while Ml)l, 
t.cnds to seh>ct a. model which is simple and yet 
tits the data reasonably well. 
-- sale 
l K~ stock sha,'~' 
t billion million 
l,'iguro 2: An example thesaurus 
We desert b c our experimental rcsull s ill th is sec- 
tion. 
5.1 Experiment 1: MDL v.s. MLE 
We COml)ared the performance of elnploying M1)\], 
as a criterion in our silnulatcd annealing algo- 
rithm, against that of employing M IA~; by sim- 
ulation experiments. We artificially constructed 
a true model of word co-occurrence, and then 
generated data according to its distributiou. We 
then used the data. to estimale a model (clustering 
words), and measured the I(L distancd ~ between 
°'l'he K\], distance (relative Clt|,l:Opy), which is 
widely used in information theory and sta, tist, ics, is 
a, nleasur,2 of 'dista, n<:c' l>~\[,wcen two distributions 
5.2 Experiment 2: Qualitative Evaluation 
We extracted roughly 180,000 case fl:anles from 
the bracketed WSJ (Wall Street Journal) corpus 
of the Penn Tree Bank (Marcus et al., 1993) as 
co-occurrence data. We then eonstrucl.ed a num- 
ber of thesauri based on these data, using our 
method. Figure 2 shows all example thesaurus 
for the 20 most frequently occurred nouns in the 
data, constructed based on their appearances as 
subject and object of roughly 2000 verbs. The 
obtained thesaurus seems to agree with human 
intuition to settle degr(~e. For example, 'million' 
and 'billion' are classilied in one IIOll\[I chlster, alld 
'stock' and 'share' arc classified together. Not all 
of tile IlOUII C\]ltsters, however, seem to be mean- 
ingful in the useflll sense. This is probably be- 
cause the. data size we had was not large enough. 
Pragmatically speaking, however, whethcl: the ob- 
tained thesaurus agrees with our intuition in itself 
is only of secondary concern, since the main lmr - 
pose is to use the constructed t.hcsaurus to help 
i~uprow~ on a disaml)igual.ion I,ask. 
(('.over and Tl,omas, 1991). \]t is Mways non-negative 
a.nd is zero iff the two distributions arc identical. 
7 
"MDL" "MLE" 
-~'" 
,t t' "'"',..,+.._ __ ...__ _.~ .......,~'"-'~",, ,,, 
........ I ........ I , . ................ 
~0 100 1000 10000 100000 
I ...... , ........ , .... , ....... , 
',, "MOL" ',., ,, "MLE" -~-. 
o.e 
0.6 
0.4 
°i I I~ 100000 
Figure 3: (a) Number of clusters versus data size and (b) KL distance wersus data size 
5.3 Experiment 3: Disambiguation 
We also evaluated our method by using a con- 
structed thesaurus in a pp-attachment disan> 
bigua.tion experiment. 
We used as training data the same 180,000 case 
fl'ames in Experiment 1. We also extracted as 
our test data 172 (verb, no~nll,prep,'noune) pat- 
terns Dora the data in the same corpus, which is 
not used in the training data. For the 150 words 
that appear in the position of ,oun.e in the test 
data, we constructed a thesaurus based on the 
co-occurrences between heads and slot. values of 
the fl'ames in the training data. This is because 
in our disambiguation test we only need a. the- 
saurus consisting of these 150 words. We then 
applied the learning method proposed in (Li and 
Abe, 1995) to learn case fl'ame patterns with the 
constructed thesaurus as input using the same 
training data. That is, we used it to learn the 
conditional distributions P( Classlll,erb, prep), 
P(Classe \[n, ounl, prep), where Class1 and Classe 
vary over the internal nodes in a certain 'cut' 
in the thesaurus tree l0 We then compare 
Table 1: PP-attachment disaml)iguation results 
Base Line 
Word-Based 
MI) L-Thesaurus 
MLE-Thesaurus 
WordNet 
Cowerage(%,) Accuracy(%) 
100 70.:2 
19.7 95.1 
33.1 93.0 
33.7 89.7 
49.4 88.2 
which are estimated based on the case fl'ame 
patterns, to determine the a.ttachment site of 
(prep, not*he). More specifically, if the former is 
larger than the latter, we attach it. to verb, and if 
the latter is larger tha.n the former, we attach it. 
to n.o'unl, and otherwise (including when both are 
1°Each 'cut.' in a t.hesa.urus tree defines a different 
noun paxt.ition. See (Li and Abe, 1995) for details. 
0), we conclude that we cannot make a decision. 
Table 1 shows the results of our pp-attachment 
disambiguation experiment in terms of 'coverage' 
and 'accuracy.' tlere 'coverage' refers to the pro- 
portion (in percentage) of the test patterns on 
which the disambiguation method could make a 
decision. 'Base Line' refers to tile method of al- 
ways ~ttaching (prep, noun.~.) to noun1. 'Word- 
Based', 'MLE-Thesaurus', and 'MDL-Thesaurus' 
respectively stand tbr using word-based estimates, 
using a thesaurus constructed by employing MLE, 
and using a thesaurus constructed by our method. 
Note that the coverage of ~MDL-Thesaurus' signif- 
iea.ntly outperformed that of 'Word-Based', while 
basically maintaining high accuracy (though it 
drops somewhat), indicating that using an auto- 
matically constructed thesaurus can improve dis- 
ambiguation results in terms of coverage. 
We also tested the method proposed in (Li and 
Abe, 1995) of learning case frames patterns using 
all existing thesaurus. In particular, we used this 
method with WordNet (Miller et al., 1993) and 
using the same training data., and then conducted 
pp-attachment disambiguation experiment using 
the obtained case frame patterns. We show the 
result of this experiment as 'WordNet' in Table 1. 
We can see that in terms of 'coverage', ~WordNet' 
outperforms 'MDL-Thesaurus', but in terms of 
"accuracy', 'MDL-Thesaurus' outperforms 'Word- 
Net.'. These results can be interpreted as follows. 
An automa.tically constructed thesaurus is more 
domaiu dependent and captures the domain de- 
pendent features better, and thus using it achieves 
high accuracy. On the other hand, since training 
data. we had available is insufficient, its coverage 
is smaller than that of a hand made thesaurus. 
In practice, it makes sense to combine both types 
of thesauri. More specifically, an atttomatically 
constructed thesaurus can be used within its cov- 
erage, and outside its coverage, a hand made the- 
saurus can be used. Given the current state of 
the word clustering technique (namely, it requires 
data size that is usually not available, and it tends 
to be computationally demanding), this strategy 
is practical. We show the result of this combined 
T~bte 2: I)l'-attachinent, disambiguation results 
M1)l,-'I'h,~saurus + Word Net 
MI)L-Thesaltrus + \VordNct: + I,A + I)efaull; 
Coverage(%) Accuracy(~) 
54,1 8 7. l 
100 85.5 
method a.s 'Ml)l/l'hesaurlts + WordNot.' it/ Ta- 
Me 2. Our exlmritnenl,al resnlt, shows lltal: eln- 
ph)yhag t;he cotn\]>ined nlet.hod does itwrease t.he 
cow:rage of disainbiguation. We also tested +M1)I, 
Thesaurus + WordNel.-t- I,A -t- l)('fatllt.', which 
sl:ands for using l.hc' learm~d thesaurus altd \Vord- 
Net first+, t.heu t.he lexical associal.iotl valtm l>rO - 
posed by (lIindle a.nd F/.ooth, 1991), and finally 
tile defa.ull; (i.e. always atl.aching \])/'el), *~ottl+2 
l;o no+tn~). Our hest disaml)iguatioll rcsull, ob- 
tained using t, his last; combined niet.tiod sontewhat 
improves t, he accuracy rei>orl.ed itt (Li and At><', 
1.995) (84.3%). 
6 Conchtding Remarks 
We have proposed a tnethod of" hierarchical <'his- 
feting of words hased on laxge corpus data. \Vo 
conclude wit, h the following remarl,:s, 
\[. ()lip ll/et ho(\[ (>\[" chtst:('ritlg w(wds has(,d cqt th(' 
MI)L t',ritlciph" is ~h<,reficalty sc.+,rtd. Our 
experimental t'esult.s show t.hal: il. is I+ot t.er to 
enq)loy MI)I, than M 1,1!; as estimation <'riW- 
rion in hierarchical word chtstering. 
2. \[lsing a tlwsaltrus consl.rt~cl,cd l>y ottr met hod 
can inq>rov(; pp-;fl.t, ach)nent, disaml)igtmtion 
results. 
3. At, t.he Clm:ent, st, a, te of the art. itt st.al, istical 
na.t;ttral languag(~ I)rocessing, it. is b('st. I.o use a 
cotnbination of an a.ut.ontat.ically const.rucl(>d 
thesa.urus and a hand made l;hesattrus \['or dis- 
atnbigua.l.ion purpose. "Fhe disaulhiglmtion 
accttra.cy obtained this way wets 85+5(/c,. 
\[u the fut.ttre, hopefillly wit, h target training dat.a 
size, we plao_ to construct, larger thesauri as well 
as to test other clustering algorit.hms. 
References 
Andrew 1~. Barren and Thomas M. (:ow;r. 
t991. Minimum comph'xit.y densil.y estima- 
tion. l.l';YE 7'ra,saclion. o, lnformatio~ The- 
or:q, 37(4):1034 t054. 
Peter F. Browu, Vincent 3. I)ella Piet.ra, Pe- 
t, er V. deSouza, Jenifer (',. I, ai, and l(ohert L. 
Mercer. t992+ (:lass-hased ~t-gratn models of 
natura.l language. Computational Li,.quistics, 
18(4):283 298. 
Tholicta.s M. (:over and .loy A. Thonias. 1991. EI- 
emen, ls of \[nformalion. 7'heor!l. ,lohu Wiley & 
Sons Inc. 
\[do I)agan, Shaul Marcus, and Shaul Makovit, ch. 
\[9!/2. (:ontextua.1 word similarit, y a.nd estitna.- 
tion fi:om sparse data. Proceedings oflhc ,701h 
A t/L, pages 1{:;,'1 tVl. 
\ViHiams A. (;ale and Kenth W. (:hutch. 1990. 
Poor esl, itnales of conl;cxt are worse {.\[tan ltOlte. 
l>,oceedings of Ihc I)A I~PA Speech and Nalu~'al 
La,:luage Workshop, pages 283 287. 
Donald llindle and Mal;s 1-1ooth. 1991. St, ructural 
ambiguity and lexicd relations. Proceedi,.:lS of 
the 291h A CL, pages 229 236. 
Donald tIindle. 1990. Noun classification front 
predicat, e-argument st, ructures. Proceedings of 
the 281h ACL, \[>ages 268 275. 
\[\[aug \[,i aud Naoki Abe. 1995. Getwralizing case 
frames using a. thesaurus and the MDL princi- 
ple. Proceedings of Rrce,t A d~,a~ccs in Nal~trol 
Langua:lc Proces.sing, pages 239 2,18. 
Mitchell P. Marcus, Beatrice Sant.oriui, and 
Mary Ann Marcinkiewicz. 1993. Bttildhig 
a. large annotated corpus of english: The 
peu.n t, reebank. Computational Linguistics, 
19(1):313 330. 
(;eorgc A. Milh'r, I~.ichar<l Beckwilh, (!hirst.ian<~ 
I:ellbaunL Derek ClrOSS, and Kat,herine Miller. 
1!)!)3. Introducl;ion to WordNet: An on- 
lira, le×ical database. .,tT~o~ymous I"7'P: clar- 
ily.l>rt, c~lo~:, cdu. 
l:ernando Pereira, Naft.ali Tishhy, and LiIlia.n l,ee. 
\[993. l)ist, ributional clustering of maglish words. 
t)rocccdings of lke .'7tsl A (TL, pages 183 190. 
a. lT.oss Quinlan aitd I¢onahl 1,. Rives/.. 1989. In- 
ferring decision trees using t.he mininiutn de- 
scription \[engt, h principle, lnformalion and 
C'omputation, 80:227- 248. 
,lorma lTissanen. 1989. ,qlor'haslic Uomple<+'it 9 in 
5'talislical \[nquiey. Worhl Scientific Publishing 
( io. 
Takenobu 'Fokunaga, Makot:o hva.yama, and 
ltozttmi Tanaka. 1995. Aut, omat.ic thesaurus 
cot~struct, ion based-on grannnat, ica/ relations. 
Proceedings of 1.1CA \['95. 
Kenji Yatnanishi. 1992. A learning criterion \['or 
stochast, ic rules. Machine Lcarnin.fl , 9:165 203. 
