CO-OCCURRENCE VECTORS FROM CORPORA VS. 
DISTANCE VECTORS FROM DICTIONARIES 
Yoshiki Niwa ~tnd Yoshihiko Nitta 
Advanced Research l~aboratory, llitachi, l,td. 
Hatoyama, Saitam~t 350-03, .)ap;m 
{niwa2, nitta)~harl.hitachi.co.jp 
Abstract 
A comparison W~LS made of vectors derived by using 
ordinary co-occurrence statistics from large text cor- 
pora and of vectors derived by measuring the inter- 
word distances in dictionary definitions. The precision 
of word sense disambiguation by using co-occurrence 
vectors frorn the 1987 Wall Street Journal (20M total 
words) was higher than that by using distance vectors 
from the Collins English l)ictionary (60K head words 
+ 1.6M definition words), llowever, other experimen-- 
tal results suggest that distance vectors contain some 
different semantic information from co-occurrence vec- 
tors. 
1 Introduction 
Word vectors reflecting word meanings are expected to 
enable numerical approaches to semantics. Some early 
attempts at vector representation in I)sycholinguistics 
were the semantic d(O'erential approach (Osgood et 
al. 1957) and the associative distribution apl)roach 
(Deese 1962). llowever, they were derived manually 
through psychological experiments. An early attempt 
at automation was made I)y Wilks el aL (t990) us-. 
ing co-occurrence statistics. Since then, there haw" 
been some promising results from using co-occurrence 
vectors, such as word sense disambiguation (Schiitze 
\[993), and word clustering (Pereira eL al. 1993). 
llowever, using the co-occurrence statistics re- 
quires a huge corpus that covers even most rare words. 
We recently developed word vectors that are derived 
from an ordinary dictionary by measuring the inter- 
word distances in the word definitions (Niwa and Nitta 
1993). 'this method, by its nature, h~s no prol)lom 
handling rare words. In this paper we examine the 
nsefldness of these distance vectors as semantic re W 
resentations by comparing them with co-occur,'ence 
vectors. 
2 Distance Vectors 
A reference network of the words in a dictionary (Fig. 
1) is used to measure the distance between words, q'he 
network is a graph that shows which words are used in 
the. definition of each word (Nitta 1988). The network 
shown in Fig. 1 is for a w~ry small portion of the refer- 
ence network for the Collins English 1)ictionary (1979 
edition) in the CI)-I{OM I (Liberman 1991), with 60K 
head words -b 1.6M definition words. 
writing unit (Or) \/ 
word 
comnmnieation / ~ alphMmtlcal \ L_\ / 
languag ,------ dictionary 
o, / \ ( : ) p,.~ople ~,ook (O~) 
Fig. 1. Portion of a reference network. 
For example, tile delinition for diclionarg is % 
book ill which the words of a language are listed al- 
phabetically .... " The word dicliona~d is thus linked 
to the words book, word, language, and alphabelical. 
A word w~etor is defined its the list of distances 
from a word to a certain sew of selected words, which 
we call origins. The words in Fig. 1 marked with 
Oi (unit, book, and people) m'e assumed to be origin 
words. In principle, origin words can be freoly chosen. 
In our exl~eriments we used mi(Idle fi'equency words: 
the 51st to 1050th most frequent words in the refer- 
ence Collins English I)ictiotmry (CI';D), 
The distance w~ctor fl)r diclionary is deriwM it'* fob 
lOWS: 
~) ... disti~uc,, ((ticl., 01) 
dictionary ~ 1 ... distance (dict., 0'2) 
2 ... distance (dicL, Oa) 
The i-4,h element is the distance (the length of the 
shortest path) between diclionary and the i-th origin, 
Oi. To begin, we assume every link has a constant 
length o\[' 1. The actual definition for link length will 
be given later. 
If word A is used in the definition of word B, t.he,m 
words are expected to be strongly related. This is the 
basis of our hypothesis that the distances in the refi~r- 
ence network reflect the associative distances between 
words (Nitta 1933). 
304 
Use (if Refe.renee Networks l{efi,rence net- 
works have been successfully used its neural networks 
(by Vdronis and Ide (1990) for word sense disainl)igua- 
tion) and as fields for artificial association, such its 
spreading activation (by Kojiina and l:urugori (1993) 
for context-coherence measurement). The distance 
vector of a word can be considered to be a list, of the 
activation strengths at the origin nodes when the word 
node is activated. Therefore, distance w~ctors can be 
expected to convey almost the santo information as 
the entire network, and clearly they are Ili~icli easier 
to handle. 
Dependence on Dietiolnlrles As a seinant{c 
representation of words, distltllCe w~ctors are expected 
to depend very weakly on the particular source dic- 
tionary. We eolilpared two sets of distance vectors, 
one from I,I)OCE (Procter 1978) and the other from 
COBUILD (Sinclair 1987), and verified that their dif- 
ference is at least snlaller than the difDrence of the 
word definitions themselves (Niwa and Nitta 1993). 
We will now describe some technical details al)Ollt 
the derivation of distance vectors. 
Lhlk Length Distance measurenient in a refer- 
ence network depends on the detinition of link length. 
Previously, we assumed for siinplicity that every link 
has a construct length. Ilowever, this shnph; definition 
seerns tlnnatllral because it does not relh'.ct word fre- 
quency. Because tt path through low-fi'equency words 
(rare words) implies a strong relation, it should be 
ineasnred ms a shorter path. Therefore, we use the fol- 
lowing definition of link length, which takes accotltlt 
of word frequency. 
length (Wi, W2) d,'I:-- - log (7N-77'~.,)n' 
This shows the length of the links between words 
Wi(i = 1,2) ill Fig, 2, where Ni denotes the total mini- 
bet of links front and to }Vi and n denotes the uulnlmr 
of direct links bt.'tween these two words. 
Fig. 2 Links between two words. 
Normalization l)istance vectors ;ire norrrial- 
ized by first changing each coordinal,e into its devi- 
ation in the coordin;tLe: 
v --: ('vi) -~+ v' = vi -- ai 
where a i and o i are the average and the standaM devi- 
ation of the distances fi'om the i-th origin. Next, each 
coordinal.e is changed hire its deviation in thc ~ vector: 
where t? and cd are tile average .~_llld i,he standard de- 
viation of v} (i = I .... ). 
3 Co-occurro.ric(; Vectors 
We use ordinary co-o(:Clll'rl;llCe statistics ;tlld illellSllre 
the co-occurrei/ce likelihood betweeii two words, X 
and Y, hy the Inutua\] hiforlnaLioii estilnate. ((\]hurch 
and ll~uiks 1989)'. 
l(X,V) = i<,g i P(x IV) 
P(X) ' 
where P(X) is the oCcilrreilce, density of word X hi 
whole corllus, and the conditional probability l'(x Iv) 
is the density of X in a neight>orhood of word Y, llere 
the neighl)orhood is defined as 50 words lie.fore or after 
s.iiy appearance of word Y. (There is a variety of neigh- 
borhood definitions Sllch as "100 sllrrollllding words" 
(Yarowsky 1992) and "within a distance of no more 
thall 3 words igllorh/g filnction words" (I)agarl el, al. 
l~)n:/).) 
The logarithm with '-t-' is dellned to be () for an ar- 
g;ument less than 1. Negative estimates were neglected 
because they are mostly accidental except when X and 
Y are frequent enough (Chnrch and lIanl,:s 1989). 
A co-occurence vector of a word is defined as the 
list of co-occtlrrellce likelihood of the word with a cer- 
tahi set o\['orighi words. We tlsed the salne set oforight 
words ;is for the distance vectors. 
I(w, ¢30 
l(w,%) 
CV\[w} = 
I(w, 0,,,) 
C(~-oeelll'l'elle( ~, V(~t'tol'. 
When the frequency of X or Y is zero, we can not 
measure their co-c, ccurence likelihood, and such cruses 
are not exceptional. This sparseness problem is well- 
known and serious in the co-occurrence sLatisC\[cs. We 
used as ~ corpus the 1!)87 Wall Street; JournM in the 
CI)-I~.OM i (1991), which has a total of 20M words. 
'\]'he nUlliber of words which appeared al, least OIlCe, 
was about 50% of the total 62I( head words of CEI), 
and tile. percentage Of" tile word-origin pairs which ap- 
peared tit least once was about 16% of total 62K × 
1K (=62M) pairs. When the co-occurrence likelihood 
Call liOt Im ineasurc~d> I,he vahle I(X, Y) was set to 0. 
305 
4 Experimental R, esults 
We compared the two vector representations by using 
them for the following two semantic tmsks. The first is 
word sense disambiguation (WSD) based on the simi- 
larity of context vectors; the second is the learning of 
positive or negative meanings from example words. 
With WSD, the precision by using co-occurrence 
vectors from a 20M words corpus was higher than by 
using distance vectors from the CEIL 
4.1 Word Sense Disambiguation 
Word sense disambiguation is a serious semantic prob- 
lena. A variety of approaches have been proposed for 
solving it. For example, V(!ronis and Ide (1990) used 
reference networks as neural networks, llearst (1991) 
used (shallow) syntactic similarity between contexts, 
Cowie el al. (1992) used simulated annealing for quick 
parallel disambignation, and Yarowsky (1992) used 
co-occurrence statistics between words and thesaurus 
categories. 
Our disambiguation method is based on the shn- 
ilarity of context vectors, which was originated by 
Wilks el al. (1990). In this method, a context vec- 
tor is the sum of its constituent word vectors (except 
the target word itself). That is, tile context vector for 
context, 
C: ...W_N ...W_l WWl ...WN, ... ~ 
is N t 
v(c) = ~ V(w~). 
i= -N 
The similarity of contexts is measured by the angle 
of their vectors (or actually the inner product of their 
normalized vectors). 
V(CI) V(C2) 
sim(C~,C.~) = lv(C~)l IV(C2)l' 
Let word w have senses sl, s2, ..., sll I } a|'ld each sells(; 
have the following context examples. 
Sense Context Examples 
sl Cll, C12, ,.. Cln, 
s2 C~l, C22 .... C~,~ 
: 
Sm Cml, Cm2, ... Cmn,,, 
We infer that the sense of word w in an arhitrary 
context C is si if for some j the similarity, sire(C, Cij), 
is maximum among all tile context examples. 
Another possible way to infer the sense is to choose 
sense si such that the average of sim(C, Cij) over 
j = 1,2,...,hi is maximum. We selected the first 
method because a peculiarly similar example is more 
important than the average similarity. 
Figure 3 (next page) shows the disamhiguation 
precision for 9 words. For each word, we selected two 
senses shown over each graph. These senses were cho- 
sen because they are clearly different and we could 
collect sufficient nmnber (more than 20) of context 
examples. The names of senses were chosen from the 
category names in Roger's International Thesaurus, 
except organ's. 
The results using distance vectors are shown by 
clots (. • .), and using co-occurrence vectors from the 
1987 vsa (20M words) by cir,.tes (o o o). 
A context size (x-axis) of, for example, 10 means 
10 words before tile target word and 10 words after 
tile target word. Wc used 20 examples per sense; 
they were taken from tlle 1988 WSJ. Tile test contexts 
were from the 1987 WSJ: The nmnber of test contexts 
varies from word to word (100 to 1000). The precision 
is the simple average of the respective precisions for 
the two senses. 
The results of Fig. 3 show that the precision by 
using co-occurrence vectors are higher than that by 
using distance vectors except two cases, interest and 
customs. And we have not yet found a case where the 
distance vectors give higher precision. Therefore we 
conclude that co-occurrence vectors are advantageous 
over distance vectors to WSD based on the context 
similarity. 
The sl)arseness problem for co-occurrence vectors 
is not serious in this case because each context consists 
of plural words. 
4.2 Learning of positivc-or-ne#alive 
Another experiment using the same two vector repre- 
sentations was done to measure tile learning of positive 
or negative meanings. 1,'igure 4 shows tile changes in 
the precision (the percentage of agreement with the 
authors' combined judgement). The x-axis indicates 
tile nunll)er of example words for each positive or ~teg- 
alive pair. Judgement w~s again done by using the 
nearest example. The example and test words are 
shown in Tables 1 and 2, respectively. 
In this case, the distance vectors were advanta- 
geous. The precision by using distance vectors in- 
creased to about 80% and then leveled off, while the 
precision by using co-occurrence vectors stayed arouud 
60%. We can therefore conclude that the property 
of positive-or-negative is reflected in distance vectors 
more strongly than ill co-occurrence vectors. Tile 
sparseness l)roblem is supposed to be a major factor 
in this case. 
306 
% 
100 
50 
suit (CLDTIIING / LAWSUIT) 
o oo^0~°°° o o o o 100 
e$ 4)4) o • • • 4) 
50 
o come. vector 
• (listance vector 
j iA, ,,llJ, J l , , 
5 10 20 30 40 50 
context size 
organ (BODY / MUSIC) 
OOoooo o o o a *o o 
o 
o • 
oo0 oe e e 
,, ,it , J ILJ i , 
5 10 20 311 4'(} 50 
issll(: (EMERGENCE / TOPIC) 
1 O(l 
50 
0 °%ooo0o ~ 8 g 
~ 0000eQOO0 
Ill, , I .) ;0 a; ;0 50 
% 
lOO 
50 
£l).Ilk (CONTAINER / VEIIICLE) ord(:r (C()MMAND / DEMAND) a(|(lri;ss (IIABITAT / SPEECH) 
tOO 
°o o °°°°°~.. ~ ~ 
~0 eO00 • 
O o 
5(I 
0 CO-(R;. V¢.C \[,O r 
a distance vector 
5 iCl 20 30 40 50 
context size 
o o ~o e~ oo o o 
oou o O o 
000 O eO • • • 
x_t_~t.~ ~ 'l'O ' 3t() ~--a 5 20 'I0 50 
I00 
50 
0 0 0 @ • 0 
Ooo e$* e. 8° O o • • • 
....... ' ; ,\['0 ' 5 l0 20 3 50 
% 
I00 
50 
race (CLASS / OPPOSITION) 
o ~°o o 
o o o • g o o o 
e,U,~ ee o • • (, • 
0 • 
o cO-oc, vector 
• distance vector 
, , , , ,,,, , , j lo io :;o ;0 5o 
context size 
('.llS|;OIIIS (IIAIIIT(pl.) / SEll.VICE) hl~.eres(; (CURIOSITY / DEBT) 
100 I 0(1 
,5\[) 
o Oo 
".°.o.'°o°'. . • o o 
• • 0 • • 
5 I0 2() 30 40 50 
5(I 
~ooeoo o ~ O 
• °ooo ~ • 
o 
5 10 20 30 40 50 
Fig. 3 Disambiguation of 9 words hy using ro-ot:rm'rence vectors(ooo) m,l hy 
using distance w.*ctors (--,). (The number of examples is 10 \['or each sense.) 
307 
100% 
50% 
........... • ....... t~JJJJJ J~2• 
• ••••o 
• ••••o • 
o•••• 
0000000 0000 0 0 0000000 
o ..... ~_990 ............... ~o 
o co-oc, vector (20M) 
• distance vector 
ii tliit l I t t llItl trill Itt t Ill~0 
10 20 
number of example pairs 
Fig. 4 Learning of posiffve-or-negative. 
Table 1 Example pairs. 
positive negative positive negative 
1 true false 16 l)roperly crime 
2 new wrong 17 succeed (lie 
3 better disease 18 worth violent 
,l clear angry 19 friendly hurt 
5 pleasure noise 20 useful punishment, 
6 correct pain 21 success poor 
7 pleasant lose 22 intcrestlng badly 
8 snltable destroy 23 active fail 
9 clean dangerous 2,1 polite suffering 
10 advantage harm 25 win enemy 
11 love kill 26 improve rude 
12 best fear 27 favour danger 
13 
1,1 
15 
snccessfld war 28 development anger 
attractive ill 29 happy waste 
powerful foolish 30 praise doubt 
Table 2 Test words. 
positive (20 words) 
balanced elaborate elation eligible enjoy 
fluent honorary Imnourable hopeful hopefully 
influential interested legible lustre normal 
recreation replete resilient restorative sincere 
negative (30 words) 
conflmion cuckold dally daumation dull 
ferocious flaw hesitate hostage huddle 
inattentive liverlsh lowly mock neglect 
queer rape ridiculous savage scanty 
sceptical schizophrenia scoff scrnffy shipwreck 
superstition sycophant trouble wicked worthless 
4.3 Supplementary Data 
In the experiments discussed above, the corpus size for 
co-occurrence vectors was set to 20M words ('87 WSJ) 
and the vector dimension for both co-occurrence and 
distance vectors wins set to 1000. llere we show some 
supplementary data that support these parameter set- 
tings. 
a. Corpus size (for co-occurrence vectors) 
Figure 5 shows the change in disambiguation pre- 
eision as the corpus size for co-occurrence statistics 
increases from 200 words to 20M words. (The words 
are suit, issue and race, the context size is 10, and 
the number of examples per sense is 10.) These three 
graphs level off after around IM words. Therefore, a 
corpus size of 20M words is not too small. 
\]00% 
50% 
* o o 
oO o 
*0 0 0 
* 0 0 
0 • 0 0 
,* ***** 
o °o o g~° 
o 0 O 
0 
o 
oo 
* suit 
0 lSSlle 
0 face 
lfl 3 104 10 ~ 1M 10M 
eorptls size (wor(I) 
Fig. 5 Dependence of the disambiguation precision 
on the corpus size for c.o-occurrence vectors. 
context size: 10, 
number of examples: 10/sense, 
vector dimension: 1000. 
l). Vector Dimension 
Figure 6 (next page) shows the dependence of dis- 
ambiguation precision on the vector dimension for (i) 
co-occurrence and (ii) distance vectors. As for co- 
occurrence vectors, the precision levels off near a di- 
mension of 100. Therefore, a dimension size of 1000 is 
suflicient or cvcn redumlant. IIowever, in the distance 
vector's case, it is not clear whether the precision is 
leveling or still increasing around 1000 dimension. 
5 Conclusion 
• A comparison was nlade of co-occnrrence vectors 
from large text corpora and of distance vectors 
from dictionary delinitions. 
• For tile word sense disambiguation based on the 
context simihtrity, co-occurrence vectors fl'om 
tile 1987 Wall Street Journal (20M total words) 
was advantageous over distance vectors from the 
Collins l,;nglish Dictionary (60K head words + 
1.6M definition words). 
• For learning positive or negalive meanings from 
example words, distance vectors gave remark- 
ably higher precision than co-occurrence vectors. 
This suggests, though further investigation is re- 
quired, that distance w:ctors contain some dif- 
ferent semantic information from co-occurrence 
vectors. 
308 
lOO% 
s0% 
100% 
5o% 
Fig. 6 
(i) by co-oe, vectors 
* * * * * 
* 0 0 0 
0 0 0 
0 0 0 0 0 0 0 
0 0 
0 0 0 
* sult 
0 18Slle 
0 l'Jlee 
=__u__L_a~t~ 10 100 1000 
vector dllllension 
(ii) by distance vectors 
~ 000 
8 oo 
o 
# * 
# 
# o • Q O 
O o O O 
8 ° 
, suit 
o issue. 
O race 
10 100 lOOO 
vector diln(msion 
I)ependence on vector dimension for (i) co- 
occurrence veetors and (ii) distance vectors. 
context size: 10, examples: 10/sense, 
corpus size for co-oe, vectors: 20M word. 
References 
Kenneth W. Church and Patrick llanks. 1989. Word 
association norms, mutual information, and lexi- 
cography. In Proceedings of lhe 27th Annual Meet- 
ing of the Association for Computalional Ling,is- 
tics, pages 76-83, Vancouver, Canada. 
Jim Cowie, Joe Guthrie, and Louise Guthrie. 1992. 
Lexieal disambiguation using simulated .:mtwal- 
ing. In Proceedings of COI, ING-92, pages 1/59-365, 
Nantes, France. 
Ido Dagan, Shaul Marcus, and Shaul Markovitch. 
1993. Contextual word similarity and estimation 
from sparse data. In Proceedings of Ihe 31st An- 
nual Meeting of the Association for Compulational 
Linguist&s, pages 164-171, Columbus, Ohio. 
James Deese. 1962. On the structure of associative 
meaning. Psychological Review, 69(3):16F 175. 
Marti A. IIearst. 1991. Noun homograph disambigna- 
tion using local context in large text eorl)ora. In 
Proceedings of lhe 71h Annum Confercncc of Ihe 
Universily of Walerloo Center for lhc New OEI) 
and Text Research, pages 1-22, Oxford. 
llideki Kozima and Teiji Furugori. 1993. Similarity 
between words computed by spreading actiw~tion 
on an english dictionary. In Proceedings of I'7A CL- 
93, pages 232--239, Utrecht, the Netherlands. 
Mark Liberman, editor. 1991. CD-ROM L Associa- 
rio,, for Comlmtational I,inguistics Data Collection 
Initiative, University of Pennsylvania. 
Yoshihiko Nitta. 1988. The referential structure, of the 
word definitions in ordinary dictionaries, h, Pro- 
ceedings of lhe Workshop on rite Aspects of Lex- 
icon for Natural Language Processing, LNL88-8, 
JSSST', pages I-21, Fukuoka University, Japan. 
(i" ,1 apanese). 
Yoshihiko Nitta. 1993. Refi.'rential structure. - a 
nmchanism for giving word-delinition in ordinary 
lexicons. In C. Lee and II. Kant, editors, Lan- 
guage, Information and Computation, pages 99- 
1 t0. Thaehaksa, Seoul. 
Yoshiki Niwa and Yoshihiko Nitta. 1993. Distance 
vector representation o\[' words, derived from refe.r- 
ence networks i,t ordinary dictionaries. MCCS 93- 
253, (;Oml)l,ting ll.esearch I,aboratory, New Mex- 
ico State University, l,as Cruces. 
C. 1';. Osgood, (l. F. Such, and P. II. Tantmnl)anln. 
1957. 7'he Measurement of Meaning. University 
of Illinois Press, Urlmna. 
Fernando Pereira, Naftali Tishby, and IAIlian Lee. 
1993. l)istributional clustering of english words. 
lit Proceedings of the 31st Annval Meeting of the 
Association for Computational Lin:luislics, pages 
I 8;I 190, Colmnlms, Ohio. 
I'aul Procter, e<lit.or. 1978. Longman Dictionary of 
Contemporary lCnglish (LI)OCE). Long\]nan, liar- 
low, Essex, tirst edition. 
llinrich Sch/itze. 1993. Word space. % J. D. Cowan 
,q. J. llanson an(I C. L. C, iles, editors, Advances 
in Neural Information lb'ocessing £'ystems, pages 
8!)5 902. Morgan Kaufinann, San Mateo, Califof 
,lia. 
John Sinclair, editor. 1987. Collins COBUILD En- 
glish Language l)iclionary. Collins and t.he Uni- 
w~rslty of llirmingham, London. 
Jean Ve'ro,fis and Nancy M. \[de. 1990. Word 
sense disambiguation with very large neural net- 
works extracted from machine readable dictionar- 
ies. In Proceedings of COLING-90, pages 389-394, 
llelsinki. 
Yorick Wilks, I)a,, Fass, Cheng mint Guo, James 1". 
MeDolmhl, Tony Plate, and Ilrian M. Slator. 1990. 
Providing machine tractable dictionary tools. Ma- 
chine Translation, 5(2):99 154. 
l)avid Yarowsky. 1992. Word-sense disambigua- 
lion using statistieal models of roget's categories 
trained on large corpora. In Proceedings of 
COLING-92, pages 454-460, Nantes, France. 
309 
