An Automatic Clustering of Articles Using Dictionary 
Definitions 
Fumiyo FUKUMOTO Yoshimi SUZUKI~ 
Dept. of Electrical Engineering a,nd Computer Science, Yalnanashi University 
4-3-11 T~keda, Kofu 4(10 ,b~pan 
{fukumotoOskye, ysuzuki~suwaj } . esi. yamanashi, ac. jp 
Abstract 
In this paper, we propose a statistical 
approach for clustering of artMes us- 
ing on-line dictionary definitions. One 
of the characteristics of our approach is 
that every sense of word in artMes is au- 
tomatically disambiguated using dictio- 
nary definitions. The other is that in or- 
der to cope with the problem of a phrasal 
lexicon, linking which links words with 
their semantically similar words in arti- 
cles is introduced in our method. The 
results of experiments demonstrate the 
effectiveness of the proposed method. 
1 Introduction 
There has been quite a lot of research con- 
cerned with automatic clustering of articles or 
autonmtic identification of selnantically similar 
articles(Walker, 1986), (Guthrie, 1994), (Yuasa, 
1995). Most of these works deal with entirely dif 
ferent articles. 
In general, the 1)rot)l(m: that the same word ca:, 
be used differently in different sul)jeet domains is 
less problematic in entirely ditferent artMes, such 
as 'weather forecasts', 'medical rel)orts' , and 'com- 
puter manuals'. Because these articles are charac- 
terised by a larger number of different words than 
that of the same words. However, in texts from 
a restricted domain such as financial artMes, e.g 
Wall Street Journal (WS,I in short) (Libernmn, 
1990), one encounters quite a large number of pop 
ysemous words. Therefore, polyseinous words of_ 
ten hamper the I)recise cla~ssification of artMes, 
each of which belongs to the restricted subject do- 
nlaiII. 
In this paper, we report an experimental study 
for clustering of articles by using on-line dic- 
tionary definitions attd show how dictionary- 
definition can use effectively to classify articles, 
each of which belongs to the restricted subject do- 
main. We first describe a method for disambiguat- 
ing word-senses in articles based on dictionary def- 
initions. Then, we present a method for classifying 
articles and finally, we rel)ort some ext)eriments in 
order to show the effect of the method. 
2 Related Work 
One of major approaches in automatic clustering 
of articles is based on statistical information of 
words in ,~rticles. Every article is characterised 
by a vector, each dimension of which is associated 
with a specific word in articles, and every coordi- 
nate of the artMe is represented by tern: weight- 
ing. Tern1 weighting methods have been widely 
studied in iufornmtion retrieval research (Salton, 
1983), (Jones, 1972) and some of then: are used 
in an automatic clustering of articles. Guthrie 
and Yuasa used word frequencies for weighting 
(Guthrie, 1994), (Yuasa, 1995), and Tokunaga 
used weighted inverse document frequency which 
is a word frequency within the document divided 
by its fl'equency throughout the entire document 
collection (Tokunaga, 1994). The results of these 
methods when al)plied to articles' cbussification 
task, seem to show its etfectiveness. However, 
these works do not seriously deal with the 1)roblem 
of polysemy. 
The alternative al)l)roach is based on dictio- 
nary's infl)rlnation as a thesaurus. One of major 
problems using thesaurus ('ategories a.s sense rep- 
rese::tation is a statistical sparseness for thesaurus 
words, since they are nmstly rather uncommon 
words (Niwa, 1995). Yuasa reported the exper- 
imental results when using word frequencies for 
weighting within large documents were better re- 
suits in clustering (lo('unmnts as those when EDR 
electronic dictionary as a thesaurus (Yuasa, 1995). 
The technique developed by Walker also used 
(lietionary's infornmtion and seems to cope with 
the discrimination of polysemy (Walker, 1986). 
He used the semantic codes of the Longmau Dic- 
tionary of Contemporary English in order to de- 
termine the subject donmin for a set of texts. For 
a given text, each word is checked against the dic- 
tionary to determine the semantic codes associ- 
ate(l with it. By accumulating the frequencies for 
these senses and then ordering the list, of cate- 
gories in terms of frequency, the subject matter of 
406 
the text can be identified. However, ~us he admits, 
a phrasal lexicon, such as Atlantic Seaboard, New 
England gives a negative influence for clustering, 
since it can not be regarded ~us units, i.e. each 
word which is the element of a 1)hrasal lexicon is 
assigned to each semantic code. 
The approach proposed in this paper focuses 
on these l)roblems, i.e. 1)olysemy and a phrasal 
lexicon. Like Guthrie and Yuasa's methods, our 
approach adopts a vector representation, i.e. ev- 
ery article is characterised by a vector. However~ 
while their ~pproaehes assign each (:oor(linate of 
a vector to each word in artMes, we use a word 
(noun) of wtfich sense is disambiguated. Our dis- 
ambiguation method of word-senses is based on 
Niwa's method whMt use(l the similarit;y 1)etween 
two sentences, i.e. a sentevee which contains a 
polysenmus noun and a sevtenee of dictionary- 
definition. In order to cope with Walker's l)rob - 
lem, for the results of disand)iguation technique, 
semantic relativeness of words are cMeulated, and 
semantically related words are grout)ed together. 
We used WSJ corpus as test artich,s in the ex- 
periments in order to see how our metho(l can 
effectively classify artMes, eacl, <)f whi<:h beh)ngs 
te the restricted subject domain, i.e. WS.I. 
3 Framework 
3.1 Word-Sense Disambiguation 
Every sense of words in artMes which should 
be (:lustered is automatically disambiguated in 
advance. Word-sense dismnl)iguation (WSD in 
short) is a serious problem for NLP, and a wlri('ty 
of al)l)roaches have been 1)roposed for solving it 
(Ih'own, 1991), (Yarowsky, 1992). 
Our disalnbiguation method is based on Niwa's 
method which used the similarity 1)etween a sen- 
tenee containing a t)olysemous noun and a sen= 
tence of dictionary-definition. Let x be a t)olyse- 
mous noun and a sentence X be 
X" • • • ~ 3:-n~ • • • ~ a'-i ~ ~1:~ ~1:1 , • " " ~ ilYn~ " " ' 
The vector representation of X is 
V(X) = ~ V(xi) 
where V(xi) is 
V(xi) = (Mu(xi,o~),...,Mu(xi,om)) 
Here, Mu(x, y) is the v',due of mutual information 
proposed by (Church, 1991). oj,...,om (We call 
them basic words) are selected the 1000th most 
frequent words in the reference Collins English 
Dictionary (Lil)erman, 1990). 
Let word x have senses sl,s2,...,sp and the 
dictionary-definition of si be 
Ysi: "",Y-n,'",Y-I,Y, Yt," "',Yn," "" 
The similarity of X and }~i is measured t)y the 
imter l)roduct of their normalised vectors and is 
detined as follows: 
v(x) • vo; ) 
= Iv(x)II vo%dl (1) 
We infer that the sense of word x in X is si if 
Hi're(X, };i) is maximnm alnong t'~ ,...,}~p. 
Giw:n ml article, the procedure for WSD is ap- 
plied to each word (noun) in an article, i.e. the 
sense of each noun is estimated using formula (1) 
and the word is rel)laced 1)y its sense. Tat)le 1 
shows samI)le of the results of our disambiguation 
nn'thod. 
Tabh~ 1: The results of the WSD lnethod 
Input A munber of major aMines adopted 
continental aMines' .-. 
Output A number5 of major airlinesl 
adopted continental2 airlines2 ... _ 
In Tal)le I, underline signifies polysenmus nolln. 
'()utlmt.' shows that ea('h noun is rel)laced l)y a 
syml)ol word which corresl)onds to each sense of 
a word. We call 'Inlmt' and ~()utput' in Table 1, 
mt (rriginal artMe and a new artMe, respectively. 
Tabh, 2: The definition of 'nnntber' 
~W\].: 
mmdwr2: 
Iltllllber3; 
nltHl|)er4: 
llllllt\])erS: 
Every mmd)er occupies a unique 
position iv a sequence. 
He was lieu (HIe of Ollr nllllll)er. 
A telel)hOnC numl)er. 
~h(" was nuInber seVelt ill tit(, ra(,c. 
A large nmnber of people. 
Table 2 shows the definition of hmml)er' in the 
Collins English, Dictionary. 'numl)erl' ~ 'nunt- 
l)er5' are symbol words and show different senses 
Of 'llunlber'. 
3.2 Linking Nouns with their 
Semantically Similar Nouns 
Our method for classification of articles uses the 
results of dismnbiguation method. The problems 
here are: 
1. The frequency of ewwy disambiguated noun 
in new articles is lower than that of every pol- 
ysemous noun in oriqinal articles. For exaln- 
ple, the frequency of 'nulnber5' in Table 1 is 
lower than that of 'number 't. Furthermore, 
some nouns in articles may be semantically 
similar with each other. For example, 'num- 
ber5' in Table 2 and 'sum4' in Table 3 arc 
ahnost the saine sense. 
2. A phr~sal lexicon which Walker suggested in 
his method gives a negatiw~ influence for clas- 
sification. 
1If all 'mlmber' are used ~s ~nunJ)er5' sense, the 
flequency of 'number' is the same as 'numl)erS'. 
407 
Table 3: The definition of 'sum' in the dictionary 
stun1: 
Slllll2: 
Slllli3: 
sunl4: 
sunlS: 
The result of the addition of num- ~? 
l'S, 
lie or inore eohntlns or rows of 
numbers to be added. The limit of the first n terms of a 
converging infinite series as ,~ tends 
to infinity. 
He borrows ellorlnoltS sluns. 
The essence or gist of a matter. 
Table 4: Pairs of nouns with Dis(vl,v2) wflues 
BBK 
0.125 share1 company1 
0.140 giorgio di 
0.215 shares2 share2 
0.262 share2 corot)any1 
0.345 new3 yorkl 
In order to cope with these prol)lems, we linked 
nouns in new articles with their semantically sim- 
ilar nouns. The procedm'es for linking are the fol- 
lowing five stages. 
Stage One: Calculating Mu 
The first stage for linking nouns with their se- 
manticMly sinfilar nouns is to calculate Mu be- 
tween noun pMr x and y in new articles. In order 
to get a reliable statistical data, we merged every 
new article into one and used it to calculate Mu. 
The results are used in the following stages. 
Stage Two: Representing every noun as a vector 
The goal of this stage is to rel)resent every noun 
in a new article as a vector. Using ~t term weight- 
ing method, nouns iI| a new article would be rep- 
resented by vector of the form 
v = (2) 
where wl is the element of a new arti(:le and cor- 
responds to the weight of the noun wl. In our 
method, the weight of wi is the wdue of Mu be- 
tween v and wi which is calculated in Stage One. 
Stage Three: Measuring similarity between 
vectors 
Given a vector representation of nouns in new 
articles ~s ill forlnula (2), a dissimilarity between 
two words (noun) vl, v2 in an article would be 
obtMned by using formula (3). A dissimilarity 
measure is the degree of deviation of the grout) 
in an n-dimensionM Euclidean space, where 'n is 
the number of nouns which co-occur with t~ 1 and 
'U 2 . 
Dis(vl,v2) = E~=, Ej'~=,(vlj - ffj)'2 (a) 
~0 = (fh,"',.q;,) is the centre of gravity and I .q I 
is the length of it. A group with a smMler value 
of (3) is considered semantically less deviant. 
Stage Four: Clustering method 
For a set of nouns Wl~ 'W2~ "'', w,~ of a new 
article, we calculate the semantic devi~ttion value 
of all possible pairs of nouns. 
Table 4 shows sample of the results of nouns 
with their semantic deviation values. 
Iu Table 4, 'BBK' shows the topic of the arti- 
cle which is tagging in the WSJ, i.e. 'Buybacks'. 
The value of Table 4 shows the semantic deviation 
vMue of two nollnS 2. 
The. clustering algorithm is applied to the sets 
shown in Table 4 and produced a set of semantic 
clusters, which are. ordered in the as('ending order 
of their semantic deviation wdues. We adopted 
non-overlal)ping , group average method in our 
clustering technique (Jardine, 1991). The sample 
results of clustering is shown in Table 5. 
Table 5: Chtstering results of 'BBK' 
0.125 \[share1 company1\] 
0.140 \[giorgio di\] 
0.215 \[shares2 shm'e2\] 
0.251 \[sharel comp~myl stmre2 shares2\] 
The wdue of TaMe 5 shows the selnantie deviation 
value of the cluster. 
Stage Five: Linking nouns with their semanti- 
cally similar nouns 
We selected different 49 artMes from 1988, 1989 
WSJ, and applied to Stage One ~ Four. From 
these results, we manuMly selected (:lusters which 
are judged to be semantically similar. For the se- 
lected chtsters, if there is a noun which belongs 
to several clusters, these clusters are grouped to- 
gether. As a result, each cluster is added to a 
sequential number. The sample of the results are 
shown in Tal)le 6. 
Table 6: The results of Stage Five 
Se(l. nlllil 
wordl : 
"lUol'd2 : 
Iuol~d3 : 
word4 : 
word5 : 
SemanticMly similar nouns 
bank3, banks3 
emiada3, emmda4 
Amerieanl, expressl 
co., corp., eompanyl • • • 
August, June, July, Sept. Oct. -.. 
new2 york2 
eIn Table 4, there m'c some nouns which are not 
added to the number, '1' ~,, '5', e.g. 'giorgio', 'di'. 
This shows that for these words, there is only one 
meaning in the dictionary. 
408 
'Seq. hum' in Table 6 shows a sequential numt)er, 
'wordl', ...,'word,,' whi(:h are added to the grou 1) 
of semantically similar nouns 3. Tal)le 6 shows, 
for examl)le , 'new2' and 'york2' are xemanti('ally 
similar and form a phn~sal lexicon. 
3.3 Clustering of Articles 
According to Table 6, freqllen('y of every word in 
new artMes ix counted, i.e. if a word in a ne, w 
article t)ehmgs to the gron l) shown ill Tal)h' 6, 
the word is rel)laced t)y its rel)resentative mmfl)('r 
'wordi' and th(' fre(luency of 'word/' is count('d. 
For (,xalnlJe, 'l)ank3' and 'banks3' in a new ~Lrti- 
cle are rel)laced by 'wordi', aud the frequen('y of 
'wordi' equals to the total nulnl)er of fr('quency of 
'bank3' and q)anks3'. 
Using a term weighting method, articles wouhl 
be represented 1)y vectors of the form 
A = (wt,w.,'",w.) (4) 
where Wi (:orresl)on(ls to the weight of the noun 
i. The weight is used to the fr('(lu(mcy of noun. 
Given the vector rel)resentations of articles as in 
formula (4), a similarity between Ai and Aj are 
caJculated using formula (1). The greater the 
wtlue of Sim(Ai, Aj) is, the ntore xinfilar these 
two articles are. The ('lustering Mgorithm whh:h 
is described in Stage Four is appticd to each 1)ah ' of 
articles, and t)roduces a set of ('lusters whh'h are 
ordered in the des(:ending order of ~heir semantic 
similarity wdues. 
4 Experiments 
We have conducted flmr ('xl)eriments, i.e. q!'req', 
'Dis', 'Link', and 'Method' in order to exanline 
how WSD me, thod and linking words with their se- 
mantically similar words(linking method in short) 
atfect the clustering results. 'Fl'eq' is fl'equency- 
t)a~sed exlmriment, i.e. we use word frequency for 
weighting and do not use WSD and linking meth- 
ods. 'Dis' is con(:erned with disambiguationd)ased 
experim(mt, i.e. the (:lustering algorithm is ap- 
plied to new artMes. 'lAnk' ix con/:erned with 
linking-l)~used experiment, i.e. we applied linking 
method to original artMes. 'Method' shows our 
proposed method. 
4.1 Data 
The training tort)us we have used ix the 1988, 
1!)89 WSJ ill ACL/DCI CD-I{OM whi(.h ('onsists 
of al)out 280,000 1)art-of-spee('h tagged sentences 
(Brill, 1992). From this eorlmx, we seh,cted at 
random 49 (lifferent articles for test data, each 
of which (-onsixts of 3,500 sentences and has dif- 
ferent tel)it ilallle wlfich is tagging in the WS,I. 
We classified 49 artMes into eight categories, e,g. 
Sin our experiments, m equals to 238. 
'market news', 'food. restaurant', etc. The di('tio- 
nary we have used is Collins English Dictionary 
in ACL/DCI CD-ROM. 
in WSD nwthod, the (:o-occurrence of x and y 
f'or cah:ulating Mu is that the two words (x,y) al)- 
pear in the training (:orl)uS in this order in a win- 
dow of 100 words, i.e. a: is folh)wed by y within 
a 100-word distance. This is because, the larger 
win(h)w sizes might be ('onsidered to be useful for 
extra('ting s(unanti(' relationshil)s between ltOltllS. 
Basic words are sele('te(l the lO00th most fre(luent 
words in the reference Collins English, Dictionary. 
'\['he length of a selltetl(:(, ~" which contains a 1)ol - 
yxemons n(mn and the h'ngth of a sentence of 
di('tionary-defilfition are maximuln 20 words. For 
ea('h t)olysemous nmm, we selected the tirst top 5 
definitions in the (lictionary. 
In linking m(,thod, a window size of the c(> 
o('('urren('e of .c and y for ('ah'ulating Mu is the 
same as that in WSD method, i.e. ~L window of 100 
words. W(' xeh,cted 969 ~ 9128 different (noun, 
nomt) pairs for each article, 377 ~ 1259 tilt%r- 
ell| llOllllS Oil condition that frequ(,ncies and Mu 
~,,,..or 1,,w (f(.,, :j) _> 5, M,,(., v) _> a) t,, per- 
mit a relial)le statistical analysis 4. As a result of 
Stage Four, we nlanually selected (:lusters wlfich 
are judged to 1)e semanti(:ally similar. As a result, 
w(' sele('te(l clusters on (:ondition that the thresh- 
old value for similarity wax 0.475. For the seh'cted 
('lusters, if there ix a noun which belongs to xev- 
('ral ehtsters, thex(, chlsters are grouped together. 
As a r(,sult, we obtahwd 238 clusters in all. 
4.2 Resnlts of the experiments 
The results are showit in TM)le 7. 
'\[h\])ie 7: The results of the experiments 
Ar|;i('h, Num Freq Link Dis Method 
5 10 4 4 5 8 
\[0 10 4 6 6 9 
15 10 7 7 7 8 
20 l.O 6 6 6 6 
Total 40 21 23 24 31 
(%) (-) (52.5) (57.5) (60.0) (77.5) 
In Tal)le 7, 'Article' means the munber of articles 
which are sele(:ted from test data. ~Nltnl' iiIPalls 
the, nunlber for each 'Article', i.e. we selected 1(I 
sets for each 'Article'. 'Freq', 'Link', 'Dis', and 
<Method' show the nulnlmr of sets which are clus- 
tered (:orrectly in ea(:h experiment. 
The samph' results of 'Article = 20' fl)r each 
(,xperiment is shown in Figure 1, 2, 3, and 4. 
In Figure 1, 2, 3, mid 4, the X-axis is the sim- 
ilarity wdue. A1)l)reviation words in each Figure 
and categories are shown in Talile 8. 
4 Her(', f(x, y) is the munl)er of total co-occurrences 
of words :e and y in this order in st window size of 100 
words. 
409 
0.9 0.7 0 5 X:slmllarity value 
I I 0.638 I *X 
\['-- BBK ~_~420 |TNM 
market~ STK 2'~~1 
news L~ 260 
16 REC 0 .206 
metal -~CS 
retailing- RET ~ 
\[ood r- RFD O .141 restauranU- FOD 
ARO 
e em ca t_MT C 
Figure 1: Tile results of 'Freq' experiment 
news t._ DIV ~ \[ 0.890 RET 
CMD .-~981~843 a762 ARC) _:=:_==_.a 
TNM 0.956 
~C S ~382 
BVG _____1 ~_1~871 
CEO PRO ~_ ~.841 0,651 
FOD ~ ~'814 ~204 
ENV ' I \[-'-- HEA I I 
I MTC 
STK BON 
Figure 2: The results of 'Link' experiment 
0.7 0.5 0.3 I I --I *,X 
.... 0,679 V ~~.53o 
marketlT~N'~l~---I ~413 news / DI~_____ ~ ~_ 0.263 
metal- PCS ~ I 
HI 0 55 -- 0 263 \[ 
restaurant\] REC~ ~ ~\] I FOP -- , \[ \[I 
t.. RFD ' I I BON- 0,1'~ I 
environment- ENV 0~7N k 
science- ARO ~ \[_J \[ 
farm~CMD 0~35 ~172 \[ chemical\[- HEA- ,~,oo -,--.~j 
t-MTC- J 0.073 
Figure 3: The results of 'Dis' experiment 
5 Discussion 
1. WSD method 
According to Table 7, there, are 24 sets which 
could be (:lustered correctly in 'Dis', while 21 sets 
in 'Freq'. Examining the results shown in Fig- 
ure 3, 'BVG' and 'HRD' are correctly classified 
into 'food • restaurant' and 'market news', respec- 
tively. However, the results of 'Freq' (Figure 1) 
shows that they are classified incorre<'tly. Table 
0.9 0T7 r~l~h~JL'9~4969 015 • X 
BB 1- \[ DIV J ~.923 
1 STK -~ ~L922 
market I TN M 0~72~L~913 
news I ~\[ L863 
I ~~ tAs2s t-HRD~ \] 10.819 
science- ARO -~ 
metal-- PCS 0.893' L, \[- B VG ~58~-756 
I FOD ~ I tl food I PRO ~\] I I.-'" 
restaurant\[ hE~- ~7~J~ "a" 
t._ RF~I) retailing-- RRF!l ),' ~ ~0:845l I 0"5~73, 
environment-- ENV 
ehemical\[-MT C 0 
farm-~M~- 
Figure 4: The results of 'Method' experiment 
Table 8: Topic: and category name 
Category 
market 
news 
science 
metal 
food 
restaurant 
Topic 
BBK: Buybacks 
BON: Bond Market News 
CEO: Dow Jones interview 
DIV: dividends 
ERN: em'nings 
HI/D: Hem'd on the street 
STK: stock mm'ket 
TNM: tender offers 
ARO: aerospace 
PCS: precious metals, stones, gold 
BVG: beverages 
FOD: food products 
PRO: corporate profile 
REC: recreation, entertainment 
RFD: restaurant, supermarket 
retailing RET: retailing 
environment 
chemical 
farm 
ENV: environment 
HEA: health care providers, medicine 
MTC: medicM and biotechnology 
CMD: commodity news, farm products 
9 shows different senses of word ill 'BVG', and 
'HRD' which could be discriminated in 'Dis'. 
In Table 9, for example, 'security' is high freqtlen- 
ties and used ill 'being secure' sense ill 'BVG' ar- 
tMe, while 'security' is 'certificate of creditorshiI)' 
sense in 'HRD'. One possible cause that the re- 
sults of 'Freq' is worse than 'Dis' is that these 
polyselnous words which are high-frequencies are 
not recognised polysemy in 'Freq'. 
2. Linking method 
As shown in Table 7, there are 23 sets which 
could be clustered correctly in 'Link', while 21 sets 
ill 'Freq'. For example, 'ERN' and 'HRD' are both 
concerned with 'market news'. In Figure 2, they 
are clustered with high similarity wflue(0.943), 
while in Figure 1, they are not(0.260). 
Exalnilfing the results, there are 811 nouns 
in 'ERN' article, and 714 nouns in 'HRD', and 
410 
Table 9: Different word-senses in BVG and HRD 
security 
rate 
sale 
stock 
BVG 
the state of 
being secure 
a quantity in relation 
the exchange of goods 
total goods 
HRD 
certificate of 
creditorship 
l)rice of charge 
the alllOllltt of sold 
stock market 
of these, 'shares', 'stock', and 'share' which are 
semantically similar ~re included. In linking 
method, there are 251 nmmn in 'ERN' and 492 
nouns in 'HRD' whi('h ~tre repl~tccd for represen- 
tative words. However, in 'Freq', each noun cor- 
responds different coordinate, and regards to dif- 
ferent meaning. As a result, these tol)ics are clus- 
tered with low similarity wdue. 
3. Our method 
Tit('. results of 'Method' show tha,t 31 out of 40 
sets are cbLssified correctly, att(I the per('entage at- 
tained was 77.5%, while 'Freq', 'Link', and 'Din' 
ext)eriment att,~tined 52.5%, 57.5%, 6().0%, renl)e(:- 
tively. This shows the effe(-tivelmss of our method. 
In Figure 4, the ~u'ticles ,tre judged to ('l,tssify 
into eight categories. Examining 'ERN', 'CEO' 
and 'CMD' in Figure 1, 'CE()' and 'CM1)' are 
grouped together, while they have (lifferent c~,t- 
egories with each other. On the other hand, 
in Figure 3, 'ERN' and 'CE()' ar(, groul)ed to- 
gether corre('tly. Examining the nouns which arc 
1)elonging to 'ERN' mid 'CE()', 'plant'(factory 
and food senses), 'oil'(petrohmnl and food), 'or- 
der'(colmn~nd ;md dema.nd), and 'interent'(del)t 
and curiosity) whi(:h are high frequencies ~re cor- 
rectly dismnbiguated. Furthermore, in Figure 4, 
'ERN' mM 'CEO' are classified into 'market news', 
and 'CMD' are cb~ssilied into 'fm:m', correctly. For 
example, 'plant' which is used in 'factory' sense in 
linked with semanti('~lly silnib~r words, 'ntanuf;w- 
turing', 'factory', 'production', or 'job' et('.. In a 
simibtr way, 'i)bmt' which in uned in flood' sense 
is linked with 'environmeltt', 'forest'. As a result, 
the articles are classified correctly. 
As shown in Table 7, there arc 9 nets which 
could not 1)e clustered correctly in our method. A 
possible improwmmnt is that we use all definitions 
of words in the dictionary. We s(qeeted the first 
top 5 definitions in the dictionary for each noun 
and used theln in the cxperilnent. However, there 
are some words of which the memfings are not in- 
cluded these selected definitions. Thin (:~mses the 
fact theft it is hard to get a higher percentage of 
correct clustering. Another interesting t)ossibil - 
ity in to use ml altermttive weighting policy, such 
a,s the widf ( weigh.te, d invcr.sc docwmcnt fre, qucncy) 
(Tokunaga, 1994). The widf is reported to have a 
marked ~ulwmtage over the idf ( invers~ ~. document 
frequency) for the text categoris~Ltion tank. 
6 Conclusion 
\Ve have rei)orted an exl)erimentad study for clus- 
tering of ~rticles by using on-line (lictiom~ry deft- 
nitions mid showed how dictionary-definitiolt cam 
use effectively to classify articles, ea('h of which be- 
longs to the restricted sul)ject domain. In order 
to Col)e with the relnainiug i)rol)lems inentioned 
in section 5 and apply thin work to practical use, 
we will conduct further e×perilnents. 
References 
P. F. Brown et al., 1991. V~ror(1-Sense Disambiguation 
Using Statisti('al Methods. In Proc. of the 291h An- 
w~tal Meeting of the A CL, pl ). 264-270. 
E. Brill, 1992. A siml)le rule-I)~Lsed 1)art of speech tag- 
ger. In Proc. of th, c 3nd conference on applied natu- 
ral language procc,ssiug, ACL, pp. 152-155. Trcnto, 
Italy, 1992. 
K. W. Church (,t al., 1991. Using Statistics in Lexi- 
eal Analysis, Lexical acquisition: Exploiting on-line 
re.qource~ to build a lea:icon. (Zernik Uri (ed.)), pp. 
115-164, London, Lawrence Erlbaum Associates. 
L. Guthric and E. Walker, "DOCUMENT CLASSI- 
FICATION BY MACHINE: Theory and Practice", 
h, Proc. of the 15th, International Confercncc on 
Computational Linguistics, Kyot% Japan, 1994, l)P. 
1059-1063 
N..lardine and 11. Sibson, 1968. The construction 
of hierarchic and non-hierarchic classifications. In 
Computer ,lourrtal, i)p. 177-1.84. 
K. S. Jones, 1973. A statistical interl)retation of term 
sp(~cificity and its apl)lieation in retrieval. Journal 
of Documentation, 28 (1973) 1, pp. 11-21. 
M. IAberman, editor. 1991. CD-ROM I, Association 
for Comlmtational Linguistics Data Collection Ini- 
tiative, University of Pennsylvania. 
Y. Niwa and Y. Nitta, 1995. Statistical Word Sense 
Disalnbiguation Using Dictionary Definitions In 
Proc. of the Natural Language Processing Pacific 
Rim Sympoaium '95, Seoul, Korea, pp. 665-670. 
G. Salton and M. a. M('Gill, 1983. Introduction to 
Modern hfformation Retrieval. McGraw-Hill, 1983. 
T. Tokunaga and M. Iwayalna, 1994. Text Categori- 
s~fl;ioll based on Weighted Inverse Doenment FI'e- 
quency IPS.\] SIG l~.cl)orts, 94-NL-1f)0, 1994. 
I). Yarowsky, "Word sense (lismnbiguation using sta- 
tistical models of I/oget's categories trained on large 
corl)ora" , In Proc. of the 14th International Confc> 
e'ncc on Computational Linguistics, Nantes, France, 
1992, l)P. 454-46{) 
N. Yuasa ('t al., 1995. Cb~ssifying ArtMes Using Lex- 
ical Co-occurrence in Large Document Databmses 
In TTu'na. of Infl~rmation Processing Society Japan, 
pp. 1819-1827, 36 (1995) 8. 
1), Walker and I/. Amsler, 1986. The Use of Machine- 
l-leadable Dietionm:ies in Sublanguage m,Mysis, An- 
alyzing Language in Restricted domains, (Grishman 
aim Kittredge (cd.)), pp. 69-84, Lawrence Erlbaum, 
ltillsdale, NJ. 11987) 2. 
411 
