Word Clustering and Disambiguation Based on Co-occurrence 
Data 
Hang Li and Naoki Abe 
Theory NEC Laboratory, Real World Computing Partnership 
c/o C&C Media Research Laboratories., NEC 
4-1-1 Miyazaki, Miyamae-ku, Kawasaki 216-8555, Japan 
{lihang,abe} @ccm.cl.nec.co.jp 
Abstract 
We address the problem of clustering words (or con- 
structing a thesaurus) based on co-occurrence data, 
and using the acquired word classes to improve the 
accuracy of syntactic disambiguation. We view this 
problem as that of estimating a joint probability dis- 
tribution specifying the joint probabilities of word 
pairs, such as noun verb pairs. We propose an effi- 
cient algorithm based on the Minimum Description 
Length (MDL) principle for estimating such a prob- 
ability distribution. Our method is a natural ex- 
tension of those proposed in (Brown et al., 1992) 
and (Li and Abe, 1996), and overcomes their draw- 
backs while retaining their advantages. We then 
combined this clustering method with the disam- 
biguation method of (Li and Abe, 1995) to derive a 
disambiguation method that makes use of both auto- 
matically constructed thesauruses and a hand-made 
thesaurus. The overall disambiguation accuracy 
achieved by our method is 85.2%, which compares 
favorably against the accuracy (82.4%) obtained by 
the state-of-the-art disambiguation method of (Brill 
and Resnik, 1994). 
1 Introduction 
We address the problem of clustering words, or that 
of constructing a thesaurus, based on co-occurrence 
data. We view this problem as that of estimating a 
joint probability distribution over word pairs, speci- 
fying the joint probabilities of word pairs, such as 
noun verb pairs. In this paper, we assume that 
the joint distribution can be expressed in the fol- 
lowing manner, which is stated for noun verb pairs 
for the sake of readability: The joint probability of 
a noun and a verb is expressed as the product of the 
joint probability of the noun class and the verb class 
which the noun and the verb respectively belong to, 
and the conditional probabilities of the noun and the 
verb given their respective classes. 
As a method for estimating such a probability 
distribution, we propose an algorithm based on the 
Minimum Description Length (MDL) principle. Our 
clustering algorithm iteratively merges noun classes 
and verb classes in turn, in a bottom up fashion. For 
each merge it performs, it calculates the increase 
in data description length resulting from merging 
any noun (or verb) class pair, and performs the 
merge having the least increase in data description 
length, provided that the increase in data descrip- 
tion length is less than the reduction in model de- 
scription length. 
There have been a number of methods proposed in 
the literature to address the word clustering problem 
(e.g., (Brown et al., 1992; Pereira et al., 1993; Li and 
Abe, 1996)). The method proposed in this paper is 
a natural extension of both Li & Abe's and Brown 
et al's methods, and is an attempt to overcome their 
drawbacks while retaining their advantages. 
The method of Brown et al, which is based on the 
Maximum Likelihood Estimation (MLE), performs 
a merge which would result in the least reduction 
in (average) mutual information. Our method turns 
out to be equivalent to performing the merge with 
the least reduction in mutual information, provided 
that the reduction is below a certain threshold which 
depends on the size of the co-occurrence data and 
the number of classes in the current situation. This 
method, based on the MDL principle, takes into ac- 
count both the fit to data and the simplicity of a 
model, and thus can help cope with the over-fitting 
problem that the MLE-based method of Brown et al 
faces. 
The model employed in (Li and Abe, 1996) is 
based on the assumption that the word distribution 
within a class is a uniform distribution, i.e. every 
word in a same class is generated with an equal prob- 
ability. Employing such a model has the undesirable 
tendency of classifying into different classes those 
words that have similar co-occurrence patterns but 
have different absolute frequencies. The proposed 
method, in contrast, employs a model in which dif- 
ferent words within a same class can have different 
conditional generation probabilities, and thus can 
classify words in a way that is not affected by words' 
absolute frequencies and resolve the problem faced 
by the method of (Li and Abe, 1996). 
We evaluate our clustering method by using the 
word classes and the joint probabilities obtained by 
749 
it in syntactic disambiguation experiments. Our 
experimental results indicate that using the word 
classes constructed by our method gives better dis- 
ambiguation results than when using Li & Abe or 
Brown et al's methods. By combining thesauruses 
automatically constructed by our method and an 
existing hand-made thesaurus (WordNet), we were 
able to achieve the overall accuracy of 85.2% for pp- 
attachment disambiguation, which compares favor- 
ably against the accuracy (82.4%) obtained using the 
state-of-the-art method of (Brill and Resnik, 1994). 
2 Probability Model 
Suppose available to us are co-occurrence data over 
two sets of words, such as the sample of verbs and 
the head words of their direct objects given in Fig. 1. 
Our goal is to (hierarchically) cluster the two sets 
of words so that words having similar co-occurrence 
patterns are classified in the same class, and output 
a thcsaurus for each set of words. 
wine 
beer 
bread 
rice 
eat drink make 
0 3 1 
0 5 1 
4 0 2 
4 0 0 
Figure 1: Example co-occurrence data 
We can view this problem as that of estimating 
the best probability model from among a class of 
models of (probability distributions) which can give 
rise to the co-occurrence data. 
In this paper, we consider the following type of 
probability models. Assume without loss of gener- 
ality that the two sets of words are a set of nouns 
A/" and a set of verbs ~;. A partition T,~ of A/" is a 
set of noun-classes satisfying UC,,eT,,Cn = A/" and 
VCi, Cj E Tn, Ci CI Q = 0. A partition Tv of 1; 
can be defined analogously. We then define a proba- 
bility model of noun-verb co-occurrence by defining 
the joint probability of a noun n and a verb v as the 
product of the joint probability of the noun and verb 
classes that n and v belong to, and the conditional 
probabilities of n and v given their classes, that is, 
P(n, v) = P(Cn, Co). P(nlC,-,) • P(vlCo), (1) 
where Cn and Cv denote the (unique) classes to 
which n and v belong. In this paper, we refer to 
this model as the 'hard clustering model,' since it is 
based on a type of clustering in which each word can 
belong to only one class. Fig. 2 shows an example of 
the hard clustering model that can give rise to the 
co-occurrence data in Fig. 1. 
P(vlOv) 
r=h  P(nlCn) 
.4 wine 
0.4 
\]0 bread 0.4 
• rioe 
make 
0.1 
0.1 
la(Cn,Cv) ./ 
Figure 2: Example hard clustering model 
3 Parameter Estimation 
A particular choice of partitions for a hard clustering 
model is referred to as a 'discrete' hard-clustering 
model, with the probability parameters left to be 
estimated. The values of these parameters can be 
estimated based on the co-occurrence data by the 
Maximum Likelihood Estimation. For a given set of 
co-occurrence data 
,S = {(nl, Yl), (r~2, V2),..., (r/m, Ore)}, 
the maximum likelihood estimates of the parameters 
are defined as the values that maximize the following 
likelihood function with respect to the data: 
m m 
1"I P(ni, vi) = II ( P(nilC,~,).P(vilCo,).P(Cn,, Co,)). 
i=1 i=1 
It is easy to see that this is possible by setting the 
parameters as 
#(Cn, Co) = f(Cn, C~)., rn 
w e u v, P( lC ) = f(x) f(C~). 
Here, m denotes the entire data size, f(Cn, Co) the 
frequency of word pairs in class pair (Cn, Co), f(x) 
the frequency of word x, and f(C~) the frequency of 
words in class C~. 
4 Model Selection Criterion 
The question now is what criterion should we employ 
to select the best model from among the possible 
models. Here we adopt the Minimum Description 
Length (MDL) principle. MDL (Rissanen, 1989) is 
a criterion for data compression and statistical esti- 
mation proposed in information theory. 
In applying MDL, we calculate the code length for 
encoding each model, referred to as the 'model de- 
scription length' L(M), the code length for encoding 
750 
the given data through the model, referred to as the 
'data description length' L(SIM ) and their sum: 
L(M, S) = L(M) + L(SIM ). 
The MDL principle stipulates that, for both data 
compression and statistical estimation, the best 
probability model with respect to given data is that 
which requires the least total description length. 
The data description length is calculated as 
L(SIM ) = - ~ log /5(n, v), 
(n,v)e8 
where/5 stands for the maximum likelihood estimate 
of P (as defined in Section 3). 
We then calculate the model description length as 
k L(M) = ~ log m, 
where k denotes the number of free parameters in the 
model, and m the entire data size3 In this paper, 
we ignore the code length for encoding a 'discrete 
model,' assuming implicitly that they are equal for 
all models and consider only the description length 
for encoding the parameters of a model as the model 
description length. 
If computation time were of no concern, we could 
in principle calculate the total description length for 
each model and select the optimal model in terms of 
MDL. Since the number of hard clustering models 
is of order O(N g • vV), where N and V denote the 
size of the noun set and the verb set, respectively, it 
would be infeasible to do so. We therefore need to 
devise an efficient algorithm that heuristically per- 
forms this task. 
5 Clustering Algorithm 
The proposed algorithm, which we call '2D- 
Clustering,' iteratively selects a suboptimal MDL- 
model from among those hard clustering models 
which can be obtained from the current model by 
merging a noun (or verb) class pair. As it turns out, 
the minimum description length criterion can be re- 
formalized in terms of (average) mutual information, 
and a greedy heuristic algorithm can be formulated 
to calculate, in each iteration, the reduction of mu- 
tual information which would result from merging 
any noun (or verb) class pair, and perform the merge 
1 We note that there are alternative ways of calculating 
the parameter description length. For example, we can sep- 
arately encode the different types of probability parameters; 
the joint probabilities P(Cn, Cv), and the conditional prob- 
abilities P(nlCn ) and P(vlCv ). Since these alternatives are 
approximations of one another asymptotically, here we use 
only the simplest formulation. In the full paper, we plan to 
compare the empirical behavior of the alternatives. 
having the least mutual information reduction, pro- 
vided that the reduction is below a variable threshold. 
2D-Clustering(S, b,, b~) 
(S is the input co-occurrence data, and bn and by 
are positive integers.) 
1. Initialize the set of noun classes Tn and the set 
of verb classes Tv as: 
Tn = {{n}ln E N'},To = {{v}lv E V}, 
where Af and V denote the noun set and the 
verb set, respectively. 
2. Repeat the following three steps: 
(a) execute Merge(S, Tn, Tv, bn) to update Tn, 
(b) execute Merge(S, Tv, Tn, b~) to update T,, 
(c) if T, and T~ are unchanged, go to Step 3. 
3. Construct and output a thesaurus for nouns 
based on the history of Tn, and one for verbs 
based on the history of Tv. 
Next, we describe the procedure of 'Merge,' as it 
is being applied to the set of noun classes with the 
set of verb classes fixed. 
Merge(S, Tn, Tv, bn) 
1. For each class pair in Tn, calculate the reduc- 
tion of mutual information which would result 
from merging them. (The details will follow.) 
Discard those class pairs whose mutual informa- 
tion reduction (2) is not less than the threshold 
of 
(k B -- ka) • logm 
2.m 
where m denotes the total data size, ks the 
number of free parameters in the model before 
the merge, and \]¢ A the number of free param- 
eters in the model after the merge. Sort the 
remaining class pairs in ascending order with 
respect to mutual information reduction. 
2. Merge the first bn class pairs in the sorted list. 
3. Output current Tn. 
We perform (maximum of) bn merges at step 2 for 
improving efficiency, which will result in outputting 
an at-most bn-ary tree. Note that, strictly speaking, 
once we perform one merge, the model will change 
and there will no longer be a guarantee that the 
remaining merges still remain justifiable from the 
viewpoint of MDL. 
Next, we explain why the criterion in terms of 
description length can be reformalized in terms of 
mutual information. We denote the model before 
a merge as Ms and the model after the merge as 
751 
MA. According to MDL, MA should have the least 
increase in data description length 
dSndat = L(S\]MA) - L(S\[~IB) > O, 
and at the same time satisfies 
(k B -- k A ) log m 6Ldat < 
2 
This is due to the fact that the decrease in model 
description length equals 
L(MB) L(MA) (kB -- kA)logm -- = > 0, 
2 
and is identical for each merge. 
In addition, suppose that )VIA is obtained by merg- 
ing two noun classes Ci and Cj in MB to a single 
noun class Cq. We in fact need only calculate the 
difference between description lengths with respect 
to these classes, i.e., 
6 Ldat = - EC.,fiT,, EnEco,veC,, l°g/b(n,v) 
+ EC~T~ EneC,,veC, log/5(n, v) 
+ EC~eT~ E,ecj,oec~ log P(n, v). 
Now using the identity 
P(n,v) - P(") P¢o) .p(c.,co) 
-- P(C,,) " P(Cv) 
_ P(C.,C~) P(n). P(v) -- p(c,o.p(cv ) • 
we can rewrite the above as 
6Ldat = - ~C~eT~ f(Cij, Co) log P(c'i'co P(co).P(C~) 
+ Y~C.eT~ f(Ci, Co) log P(C,,C~) P(cd.P(c~) 
P(C1,Cv) + ~C~eTv f(Cs, Cv)log p(cD.P(c 0 . 
Thus, the quantity 6Laat is equivalent to the mutual 
information reduction times the data size. ~ We con- 
elude therefore that in our present context, a cluster- 
ing with the least data description length increase is 
equivalent to that with the least mutual information 
decrease. 
Canceling out P(Cv) and replacing the probabil- 
ities with their maximum likelihood estimates, we 
obtain 
1 _1( C. "~6Ldat -- "~ -- ~'~C.eT.(f( " Co) -4- f(Cj, Co)) 
1o~/(c"co+l( c~'cO ~' f(cd+l(cD 
+ ~C~eT. " f(C/,Cv) log .f(c,) 
+ EC.eT. f(Ci, Co)log .f (~0). 
(2) 
2Average mutual information between Tn and To is defined 
a~ 
/ P(C,~,Cv) \ I(Tn'T°)= E E ~P(Cn'Cv)l°gp(cn).p(cv) )" 
Cn ETn Ct, ETv 
Therefore, we need calculate only this quantity for 
each possible merge at Step 1 of Merge. 
In our implementation of the algorithm, we first 
load the co-occurrence data into a matrix, with 
nouns corresponding to rows, verbs to columns. 
When merging a noun class in row i and that in 
row j (i < j), for each Co we add f(Ci,Co) and 
f(Cj,Co) obtaining f(Cij, Co), write f(Cij,Co) on 
row i, move f(Czast,Co) to row j, and reduce the 
matrix by one row. 
By the above implementation, the worst case time 
complexity of the algorithm is O(N 3 • V + V 3 • N) 
where N denotes the size of the noun set, V that of 
the verb set. If we can merge bn and bo classes at 
each step, the algorithm will become slightly more 
V 3 . efficient with the time complexity of O( bN--\]-\]. V + ~j 
g). 
6 Related Work 
6.1 Models 
We can restrict the hard clustering model (1) by as- 
suming that words within a same class are generated 
with an equal probability, obtaining 
1 1 P(n,v) = 
P(C.,C~). lC.i ICol' 
which is equivalent to the model proposed by (Li and 
Abe, 1996). Employing this restricted model has the 
undesirable tendency to classify into different classes 
those words that have similar co-occurrence patterns 
but have different absolute frequencies. 
The hard clustering model defined in (1) can also 
be considered to be an extension of the model pro- 
posed by Brown et al. First, dividing (1) by P(v), 
we obtain 
(3) 
P(C~)'P(vIC~) Since hard clustering implies P(o) = 1 
holds, we have 
P(nlO = P(C.IC~). P(nlC.). 
In this way, the hard clustering model turns out to be 
a class-based bigram model and is similar to Brown 
et al's model. The difference is that the model of (3) 
assumes that the clustering for Ca and the clustering 
for C, can be different, while the model of Brown et 
al assumes that they are the same. 
A very general model of noun verb joint probabil- 
ities is a model of the following form: 
P(n,v)-- E E P(C.,C~).P(n\]C.).P(vlC~). 
C~EP. C,, E Pv 
(4) 
752 
Here Fn denotes a set of noun classes satisfying 
Uc~r.Cn = Af, but not necessarily disjoint. Sim- 
ilarly F~ is a set of not necessarily disjoint verb 
classes. We can view the problem of clustering words 
in general as estimation of such a model. This type 
of clustering in which a word can belong to several 
different classes is generally referred to as 'soft clus- 
tering.' If we assume in the above model that each 
verb forms a verb class by itself, then (4) becomes 
P(n,v) = Z P(C.,v). P(nlC.), 
C~EF~ 
which is equivalent to the model of Pereira et al. On 
the other hand, if we restrict the general model of (4) 
so that both noun classes and verb classes are dis- 
joint, then we obtain the hard clustering model we 
propose here (1). All of these models, therefore, are 
some special cases of (4). Each specialization comes 
with its merit and demerit. For example, employing 
a model of soft clustering will make the clustering 
process more flexible but also make the learning pro- 
cess more computationally demanding. Our choice 
of hard clustering obviously has the merits and de- 
merits of the soft clustering model reversed. 
6.2 Estimation criteria 
Our method is also an extension of that proposed 
by Brown et al from the viewpoint of estimation cri- 
terion. Their method merges word classes so that 
the reduction in mutual information, or equivalently 
the increase in data description length, is minimized. 
Their method has the tendency to overfit the train- 
ing data, since it is based on MLE. Employing MDL 
can help solve this problem. 
7 Disambiguation Method 
We apply the acquired word classes, or more specif- 
ically the probability model of co-occurrence, to the 
problem of structural disambiguation. In particular, 
we consider the problem of resolving pp-attachment 
ambiguities in quadruples, like (see, girl, with, tele- 
scope) and that of resolving ambiguities in com- 
pound noun triples, like (data, base, system). In 
the former, we determine to which of 'see' or 'girl' 
the phrase 'with telescope' should be attached. In 
the latter, we judge to which of 'base' or 'system' 
the word 'data' should be attached. 
We can perform pp-attachment disambiguation by 
comparing the probabilities 
/5~ith (telescopelsee),/Swith (telescop elgirl). (5) 
If the former is larger, we attach 'with telescope' 
to 'see;' if the latter is larger we attach it to 'girl;' 
otherwise we make no decision. (Disambiguation on 
compound noun triples can be performed similarly.) 
Since the number of probabilities to be estimated 
is extremely large, estimating all of these probabil- 
ities accurately is generally infeasible (i.e., the data 
sparseness problem). Using our clustering model to 
calculate these conditional probabilities (by normal- 
izing the joint probabilities with marginal probabil- 
ities) can solve this problem. 
We further enhance our disambiguation method 
by the following back-off procedure: We first esti- 
mate the two probabilities in question using hard 
clustering models constructed by our method. We 
also estimate the probabilities using an existing 
(hand-made) thesaurus with the 'tree cut' estima- 
tion method of (Li and Abe, 1995), and use these 
probability values when the probabilities estimated 
based on hard clustering models are both zero. Fi- 
nally, if both of them are still zero, we make a default 
decision. 
8 Experimental Results 
8.1 Qualitative evaluation 
In this experiment, we used heuristic rules to extract 
verbs and the head words of their direct objects from 
the lagged texts of the WSJ corpus (ACL/DCI CD- 
ROM1) consisting of 126,084 sentences. 
-- s~are, a~et. data 
-- stock. ~no, secur~ 
-- inc..corp..co. 
i bourne, home 
-- DenK. group, firm 
pr~e. tax 
-- money, ca~ 
-- c~lr. vl~llicle 
-- profit, risk 
-- so.are, network 
-- pressure+ power 
Figure 3: A part of a constructed thesaurus 
We then constructed a number of thesauruses 
based on these data, using our method. Fig. 3 shows 
a part of a thesaurus for 100 randomly selected 
nouns, based on their appearances as direct objects 
of 20 randomly selected verbs. The thesaurus seems 
to agree with human intuition to some degree, al- 
though it is constructed based on a relatively small 
amount of co-occurrence data. For example, 'stock,' 
'security,' and 'bond' are classified together, despite 
the fact that their absolute frequencies in the data 
vary a great deal (272, 59, and 79, respectively.) 
The results demonstrate a desirable feature of our 
method, namely, it classifies words based solely on 
the similarities in co-occurrence data, and is not af- 
fected by the absolute frequencies of the words. 
8.2 Compound noun disambiguation 
We extracted compound noun doubles (e.g., 'data 
base') from the tagged texts of the WSJ corpus and 
used them as training data, and then conducted 
753 
structural disambiguation on compound noun triples 
(e.g., 'data base system'). 
We first randomly selected 1,000 nouns from the 
corpus, and extracted compound noun doubles con- 
taining those nouns as training data and compound 
noun triples containing those nouns as test data. 
There were 8,604 training data and 299 test data. 
We hand-labeled the test data with the correct dis- 
ambiguation 'answers.' 
We performed clustering on the nouns on the 
left position and the nouns on the right position in 
the training data by using both our method ('2D- 
Clustering') and Brown et al's method ('Brown'). 
We actually implemented an extended version of 
their method, which separately conducts clustering 
for nouns on the left and those on the right (which 
should only improve the performance). 
0.85 
0.8 
0.75 
o.7 
0.85 
0.6 
0.55 
o.~ 
• Worcl-~ase~ ~ ! ro~vn" "2D.-Clus~enng" .e.- 
o.~5 ole o.~5 0:7 o.Y5 0'.8 o.~5 o.g CovefarJe 
Figure 4: Compound noun disambiguation results 
We next conducted structural disambiguation on 
the test data, using the probabilities estimated based 
on 2D-Clustering and Brown. We also tested the 
method of using the probabilities estimated based 
on word co-occurrences, denoted as 'Word-based.' 
Fig. 4 shows the results in terms of accuracy and 
coverage, where coverage refers to the percentage 
of test data for which the disambiguation method 
was able to make a decision. Since for Brown the 
number of classes finally created has to be designed 
in advance, we tried a number of alternatives and 
obtained results for each of them. (Note that, for 
2D-Clustering, the optimal number of classes is au- 
tomatically selected.) 
Table 1: Compound noun disambiguation results 
Method Acc.(%) 
Default 59.2 
Word-based + Default 73.9 
Brown + Default 77.3 
2D-Clustering + Default 78.3 
Tab. 1 shows the final results of all of the above 
methods combined with 'Default,' in which we at- 
tach the first noun to the neighboring noun when 
a decision cannot be made by each of the meth- 
ods. We see that 2D-Clustering+Default performs 
the best. These results demonstrate a desirable as- 
pect of 2D-Clustering, namely, its ability of automat- 
ically selecting the most appropriate level of clus- 
tering, resulting in neither over-generalization nor 
under-generalization. 
8.3 PP-attachment dlsambiguation 
We extracted triples (e.g., 'see, with, telescope') 
from the bracketed data of the WSJ corpus (Penn 
Tree Bank), and conducted PP-attachment disam- 
biguation on quadruples. We randomly generated 
ten sets of data consisting of different training and 
test data and conducted experiments through 'ten- 
fold cross validation,' i.e., all of the experimental 
results reported below were obtained by taking av- 
erage over ten trials. 
Table 2: PP-attachment disambiguation results 
Method Coy.(%) Acc.(%) 
Default 100 56.2 
Word-based 32.3 95.6 
Brown 51.3 98.3 
2D-Clustering 51.3 98.3 
Li-Abe96 37.3 94.7 
WordNet 74.3 94.5 
NounClass-2DC 42.6 97.1 
We constructed word classes using our method 
('2D-Clustering') and the method of Brown et al 
('Brown'). For both methods, following the pro- 
posal due to (Tokunaga et al., 1995), we separately 
conducted clustering with respect to each of the 10 
most frequently occurring prepositions (e.g., 'for,' 
'with,' etc). We did not cluster words for rarely 
occurring prepositions. We then performed disam- 
biguation based on 2D-Clustering and Brown. We 
also tested the method of using the probabilities es- 
timated based on word co-occurrences, denoted as 
'Word-based.' 
Next, rather than using the conditional probabili- 
ties estimated by our method, we only used the noun 
thesauruses constructed byour method, and applied 
the method of (Li and Abe, 1995) to estimate the 
best 'tree cut models' within the thesauruses a in 
order to estimate the conditional probabilities like 
those in (5). We call the disambiguation method 
using these probability values 'NounClass-2DC.' We 
also tried the analogous method using thesauruses 
constructed by the method of (Li and Abe, 1996) 
3The method of (Li and Abe, 1995) outputs a 'tree cut 
model' in a given thesaurus with conditional probabilities at- 
tached to all the nodes in the tree cut. They use MDL to 
select the best tree cut model. 
754 
and estimating the best tree cut models (this is ex- 
actly the disambiguation method proposed in that 
paper). Finally, we tried using a hand-made the- 
saurus, WordNet (this is the same as the disam- 
biguation method used in (Li and Abe, 1995)). We 
denote these methods as 'Li-Abe96' and 'WordNet,' 
respectively. 
Tab. 2 shows the results for all these methods in 
terms of coverage and accuracy. 
Table 3: PP-attachment disambiguation results 
Method Acc.(%) 
Word-based + Default 
Brown + Default 
2D-Clustering + Default 
Li-Abe96 + Default 
WordNet + Default 
NounClass-2DC + Default 
69.5 
76.2 
76.2 
71.0 
82.2 
73.8 
2D-Clustering + WordNet + Default 85.2 
Brill-Resnik 82.4 
We then enhanced each of these methods by using 
a default rule when a decision cannot be made, which 
is indicated as '+Default.' Tab. 3 shows the results 
of these experiments. 
We can make a number of observations from these 
results. (1) 2D-Clustering achieves a broader cover- 
age than NounClass-2DC. This is because in order 
to estimate the probabilities for disambiguation, the 
former exploits more information than the latter. 
(2) For Brown, we show here only its best result, 
which happens to be the same as the result for 2D- 
Clustering, but in order to obtain this result we had 
to take the trouble of conducting a number of tests to 
find the best level of clustering. For 2D-Clustering, 
this was done once and automatically. Compared 
with Li-Abe96, 2D-Clustering clearly performs bet- 
ter. Therefore we conclude that our method im- 
proves these previous clustering methods in one way 
or another. (3) 2D-Clustering outperforms WordNet 
in term of accuracy, but not in terms of coverage. 
This seems reasonable, since an automatically con- 
structed thesaurus is more domain dependent and 
therefore captures the domain dependent features 
better, and thus can help achieve higher accuracy. 
On the other hand, with the relatively small size of 
training data we had available, its coverage is smaller 
than that of a general purpose hand made thesaurus. 
The result indicates that it makes sense to combine 
automatically constructed thesauruses and a hand- 
made thesaurus, as we have proposed in Section 7. 
This method of combining both types of the- 
sauruses '2D-Clustering+WordNet+Default' was 
then tested. We see that this method performs the 
best. (See Tab. 3.) Finally, for comparison, we 
tested the 'transformation-based error-driven learn- 
ing' proposed in (Brill and Resnik, 1994), which is 
a state-of-the-art method for pp-attachment disam- 
biguation. Tab. 3 shows the result for this method 
as 'Brill-Resnik.' We see that our disambigua- 
tion method also performs better than Brill-Resnik. 
(Note further that for Brill & Resnik's method, we 
need to use quadruples as training data, whereas 
ours only requires triples.) 
9 Conclusions 
We have proposed a new method of clustering words 
based on co-occurrence data. Our method employs 
a probability model which naturally represents co- 
occurrence patterns over word pairs, and makes use 
of an efficient estimation algorithm based on the 
MDL principle. Our clustering method improves 
upon the previous methods proposed by Brown et al 
and (Li and Abe, 1996), and furthermore it can be 
used to derive a disambiguation method with overall 
disambiguation accuracy of 85.2%, which improves 
the performance of a state-of-the-art disambiguation 
method. 
The proposed algorithm, 2D-Clustering, can be 
used in practice, as long as the data size is at the 
level of the current Penn Tree Bank. Yet it is still 
relatively computationally demanding, and thus an 
important future task is to further improve on its 
computational efficiency. 
Acknowledgement 
We are grateful to Dr. S. Doi of NEC C&C Media 
Res. Labs. for his encouragement. We thank Ms. Y. 
Yamaguchi of NIS for her programming efforts. 

References 
E. Brill and P. Resnik. A rule-based approach to 
prepositional phrase attachment disambiguation. 
Proc. of COLING'9~, pp. 1198-1204. 
P. F. Brown, V. J. Della Pietra, P. V. deSouza, J. 
C. Lai, and R. L. Mercer. 1992. Class-based n- 
gram models of natural language. Comp. Ling., 
18(4):283-298. 
H. Li and N. Abe. 1995. Generalizing case frames 
using a thesaurus and the MDL principle. Comp. 
Ling., (to appear). 
H. Li and N. Abe. 1996. Clustering words with the 
MDL principle. Proc. of COLING'96, pp. 4-9. 
F. Pereira, N. Tishby, and L. Lee. 1993. Distri- 
butional clustering of English words. Proc. of 
ACL'gg, pp. 183-190. 
J. Rissanen. 1989. Stochastic Complexity in Statisti- 
cal Inquiry. World Scientific Publishing Co., Sin- 
gapore. 
T. Tokunaga, M. Iwayama, and H. Tanaka. Auto- 
matic thesaurus construction based-on grammat- 
ical relations. Proc. of IJCAI'95, pp. 1308-1313. 
