10 
1965 International Conference on 
Computational Linguistics 
NEASURENENT OF SI~IILARITY I~ETWI!EN NOUNS 
Kenneth E. llarper 
Tile IOiND Corporation 
1700 Main Street 
Santa blonica, California 9041)6 
AJ;STt',A(?T 
A study was r~ade of tile degree of similarity between 
pairs of Russian nouns, as expressed by their tendency to 
occur in sentences with identical ~,,ords in identical 
syntactic relationships. A similarity matrix was prepared 
for forty nouns; for each pair of nouns the number of 
shared (i) adjective dependents, (ii) noun dependents, and 
(iii) noun governors was automatically retrieved from 
machine-processed text. The similarity coefficient for 
each pair ~;as determined as the ratio of the total of 
such shared ~'ords to the product of the frequencies of the 
two nouns in the text. The 78~ pairs were ranked according 
to this coefficient. The text comprised 12(1,~00 running 
words of physics text processed at The RAND Corporation; 
the frequencies of occurrence of the forty nouns in this 
text ranged from 42 to 328. 
The results suggest that the sample of text is of 
sufficient size to be useful for the intended purpose. Many 
noun pairs with similar properties (synonymy, antonym),, 
derivation from distributionally similar verbs, etc.) are 
characterized by high similarity coefficients; the converse 
is not observed. The relevance of various syntactic rela- 
tionships as criteria for meas~rement is discussed. 
\[larper 1 
MEASURENIiNT OF SIMILARITY BETWEEN NOUNS 
I. INTRODUCTION 
One of the goals of studies in Distributional Semantics 
is the establishment of word classes on the basis of the 
observed behavior of words in written texts. A convenient 
and significant way of discussing "behavior" of words is 
in terms of syntactic relationship. At the outset, in 
fact, it is necessary that we treat a word in terms of its 
Syntactically Related Words (SRW). In a given text, each 
word bears a given syntactic relationship to a finite num- 
ber of other words; e.g., a finite number of words (nouns 
and pronouns) appear as "subject" for each active verb; 
another group of nouns and pronouns are used as "direct 
object" of each transitive verb; other words of the class, 
"adverb," appear as modifiers of a given verb. In each 
instance we may speak of the related words as SRW of a given 
verb, so that in our example three different ~ of SRW 
emerge; a given SRW is then defined in terms both of word 
class and specific relationship to the verb. (A given noun 
may of course belong to two different types of SRW, e.g., 
as both subject and object of the same verb.) 
Distributionally, we may compare two verbs in terms 
of their SRN. The objective of the present study is to 
test the premise that "similar" words tend to have the same 
SRW. This premise is tested, not with verbs, as in the 
l,arper 
above example, but with nouns. Our procedure is (i) to find 
in a given text three types of SRW for a small group of 
nouns, (2) to find the number of Sill; T shared by each pair 
of nouns formed from the group, and (3) to express the 
"similarity" between individual nouns) and groups of nouns, 
as a function of their shared SRI~. Another example: it 
might turn out that in a given text the nouns "a" and "b" 
("avocado" and "cherry") share such adjective modifiers as 
"ripe," whereas nouns "c )' and "d" ("chair" and "furniture") 
have in common the adjective modifier "modern." These 
facts would lead us to conclude that "a" and "b" are simi- 
lar, that "c" and "d" are similar) that "a" and "c" are 
less similar, etc. 
A number of questions arise: What is "similarity" 
anyway? Do words that are similar in meaning really share 
a significant number of SRW in a given text? What is "a 
significant number"? Do not dissimilar words also have many 
common SRW? flow much text is necessary in order to estab- 
lish patterns of word behavior? What is the effect of 
multiple=meaning in words, and of using, texts from differ = 
ent subject areas? The present investigation should be 
regarded as an experiment designed to throw some light on 
these questions; no validity is claimed for the "results" 
obtained. Our audacity in attempting the experiment at all 
is based on three factors: the possession of a text in a 
limited field (physics), the foreknowledge that the multiple = 
llarper 3 
meaning probler:l is mininlal, and the capability for automatic 
processing of text. (The latter is clearly a necessity, 
in view o£ the size and complexity of the problem.) The 
reader may well conclude that the experiment proves nothing. 
We would hope, however, that such an opinion would not 
preclude a critical judgment of the procedures employed, 
or the suspension of disbelief if the results do not 
correspond with his expectations. 
2. PROCIiDIIRI'\] 
Tile present study was based on a series of articles 
from Russian physics journals, comprising approximately 
120)000 running words (some 500 pages). The processinp, of 
this te.xt has been described elsewhere, (1'2) ltere, we 
note only that each sentence of this text is recorded 
on magnetic tape, together with the following information 
for each occurrence in the sentence: its part of speech, 
its "word number" (an identification number in the machine 
glossary}, and its syntactic "governor" or "dependent" 
(i£ any) in the sentence. A retrieval program applied to 
this text tape then yielded information about the SRI'i for 
words in which we were interested. For convenience and 
economy, all words in the machine printout for this study 
are identified by word number, rather than in their "natural- 
language" £o rv). 
In our study we chose to deal wit\]~ the SRI~ of forty 
Russian nouns, herein called Test ~ords {TW). The number 
ltarper 4 
is completely arbitrary; tile particular nouns chosen (see 
Table 1) a'ere presumed to form different semantic groupings. 
Table 1 gives one possible grouping of these words; the 
criteria for grouping are more or less obvious, although 
the reader may easily form different groups, by expanding 
or contracting the groups that we have designated. The 
only purpose of grouping is to provide a weak measure of 
control in the experiment: if two nouns are found to be 
similar in terms of their SRN, we should like to compare 
this finding with some intuitive understanding of their 
similarity. (For convenience, we shall refer to the 'rWs 
by their English equivalents.) 
Two nouns may be compared with reference to several dif- 
ferent types of SRW. ilere, we have chosen to iimit our 
comparison to three types: t.t}e adjective dependents (in 
either attributive or predicative function), the noun 
depend.ents (normally, but not necessarily, in the genitive 
case in Russian), and the noun governors (the TN is nor- 
mally, but not necessarily, in tile genitive case). Strictly 
speaking, the syntactic function of the SRIq should be taken 
into account. In ignoring this factor, we are consciously 
permitting certain inexactitudes, on the premise that the 
distortions introduced into measurement will not be severe. 
The task of manualiy retrieving SRW for each occurrence 
of the 40 TWs, and of comparing each TW with every other 
TW, is too tedious to be attempted. The aid of the computer 
was enlisted, in two ways, 
llarper 5 
Table 1 
39 TEST NOUNS 
Group l 
calculation 1 
measurement 
determination 
calculation 2 
Grou p 2 
cc:ns-ideration 
comparison 
study 
investigation 
Group 3 
relation 
ratio 
correspondence 
Group 4 
solution 
compound 
alloy 
G, roup 5 
metal 
gas 
liquid 
crystal 
Group 6 
uranium 
silver 
copper 
phosphor 
Group 7 
proton 
ion 
molecule 
atom 
Group 8 
formula 
expression 
equation 
Grou 9 
" w~dth 
depth 
length 
height 
Group 10 
presence 
ab sence 
existence 
Group 11 
que s tion 
problem 1 
prob 1era 2 
W No. F L1 
vycislenie 782 62 15 
izmerenie 1579 328 29 
opredelenie 3324 121 7 
rascet 4627 90 12 
rassmotrenie 4598 Sl 14 
sravnenie 5200 106 6 
izuienie 1610 64 8 
issledovanie 1723 159 32 
sootnosenie 5111 113 14 
otno~enie 3455 102 14 
sootvetstvie 5109 29 2 
rastvor 4608 129 6 
soedinenie 5082 15 5 
splay S182 27 6 
metall 2400 86 ii 
gaz 807 37 7 
Zidkost' 1329 56 8 
kristall 2131 171 15 
uran 5745 171 0 
serebro 4899 48 4 
med ' 2419 58 2 
fosfor 5913 130 9 
proton 4565 125 8 
ion 1686 98 14 
molekula 2568 112 18 
atom 186 106 9 
formula 5911 231 20 
vyrazenie 739 223 25 
uravnenie 5742 412 42 
sirina 6198 43 4 
glubina 913 40 6 
dlina 1194 112 16 
vysota 764 23 2 
nalicie 2696 119 3 
otslzts tvie 3485 44 2 
su~ des tvovanie 5352 41 3 
repros 615 96 5 
zadada 1362 68 15 
problema 4254 26 4 
L2 
23 
63 
39 
24 
29 
22 
44 
65 
18 
22 
i 
22 
5 
2 
2 
2 
2 
19 
0 
1 
3 
2 
2 
I0 
18 
23 
21 
• 12 
24 
9 
8 
21 
ii 
73 
35 
25 
3 
11 
i0 
1,3 
II 
36 
14 
16 
6 
4 
6 
21 
15 
9 
0 
24 
6 
4 
28 
8 
15 
44 
18 
17 
20 
34 
27 
31 
39 
28 
19 
24 
32 
9 
9 
22 
3 
5 
1 
6 
I0 
i0 
6 
L4 
49 
128 
60 
53 
49 
32 
58 
11~ 
47 
45 
3 
52 
16 
12 
41 
17 
25 
78 
18 
22 
25 
45 
37 
55 
75 
60 
60 
61 
98 
22 
23 
59 
16 
81 
38 
34 
18 
36 
20 
"~ No." = word number; "F" = frequency 
tlarper 6 
i. Through automatic scanning of the text, each 
occurrence of tile 40 TWs was located, and in each instance 
the identity (word number) of relevant SR~V was recorded. 
A listing is produced for each of the TWs (see Table 2, 
"SRW Detail," for an example of the TW, VYCISLENIE = calcu- 
lation 1), showing tile different words used as adjective 
dependents (List i), noun dependents (List 2), and noun 
governors (List 3). Tile number of words on each of these 
lists is also shown in Table i, together with the total 
number of SRW for each TW (List 4). We stress the fact that 
these numbers refer to different words used as SRW; the 
repetition of a given SRW (for a given SRW type) was not 
recorded. 
2. Each Tl~ was automatically compared with every other 
TW, with respect to their shared SRW, i.e., in terms of 
the words i~ Lists I, 2, and 3 of the "SRW Detail Listing." 
A new listing, "Similarity Ranking by T%~'," is then produced 
(see Table 3 for the T~, VYCISLENIE = calculationl). This 
listing shows for each TW the number of shared SRW of each 
of the three types (NI, N2, and N3, Table 3), the total 
number of shared SR%~ (NA), and a measure of similarity for 
the pairs, herein designated as the Similarity Coefficient 
(SC). The SC is a decimal fraction obtained by dividing 
the sum of shared SRW for each pair of TWs by the product 
of the frequencies of the two TWs. (The latter is of course 
a device for taking into account the differing frequencies 
llarper 7 
,-1 N ~ 
< 
~P 
Z 
Z 
u.l 
D 
II 
=d 
0 0 
0 0 
C (D 
0 0 
(D C 
0 
e~ ,-, 
0 0 
0 0 
0 C~ 0 wO 
WD 0 
(D 0 0 
0 (D 0 
0 0 C) 
o o 0 
0 0 0 
0 0 0 
0 ~ o 
O o 0 0 
0 0 0 C 
0 0 0 0 
0 0 0 0 
0 c'3 (D 0 
o 0 0 0 
o C:, o 0 
C, c o o 
0 o 0 C ,,,t 
,, ~ * ., ? - 
O" 
• o g o e d * 
ii ii ii 
o 0 o 0 
" ° g ° ° L ° 
. 2" ~ ~, Z ~- ,0 Ii Z ,4' 
Z ~.~ Z ~.~ Z u.~ 
o, 
Z Z Z 
C Q 0 
g ~ ,,==~C" .=gC:,,= 
tiarpcr 
,.. 
t.') 
>- 
LD 
}--4 
>.~ 
< 
, • • • .~ .... • • ,~ • , , , • • • • , .... , , 
J 
.~J 3: 
2" 
~ C C 0 C ~--. C C 0 C. "2_ 
e" .'? C "~ C C C C..~ C C 
"T_ C ~ C LD ." "D ~ C ~ C C 
CC C TC CCC .. ~ 
C CDC ~ OC 
COO~C~ 
cc=o=z 
CC~C 
~CC~CC C~OCCCCCC~C~C~C 
C ~ C C GC C C ~ C C C GC ~ ~ ~ ~'C C 
C~C~CCC~COGC~CC~COC 
Iiarper 9 
of the TWs; other means for determining this coefficient 
can be utilized.) The pairings for each TW are ordered on 
the value of the SC. It should be noted that the similarity 
between TWs is measured in terms of the total number of 
shared SRW (Column NA of Table 3); it is also possible to 
express this measurement in terms of shared SRW of any 
single type. 
A third listing was also produced: a listing of the 
7,~I) TI~'-pairs, ordered oll the value of the SC. This listing, 
not reproduced here because of its length, will be referred 
to as "Ranking of TW-Pairs by SC~." 'Fable 4 shows the dis- 
tribution of the SC as compared with tile number of TW pairs. 
The following discussion is based on the three list- 
ings described above. A few additional remarks may be 
made about the procedure itself, which may be likened to 
deep-sea fishing with a tea strainer full of holes. The 
limitations of size are obvious: we have limited ourselves 
to three of the numerous ways of comparing nouns in terms 
of their SRW. Other types of SRW that suggest themselves 
are: verbsj where TW is subject; verbs, where TW is direct 
object; prepositional phrases as dependents, or governors, 
of TW; nouns joined to TW through coordinate conjunctions 
(i.e., "apples" and "grapes" are said to be more similar if 
"apples and oranges" and "grapes and oranges" occur in 
text). Some of the holes in our tea strainer are: the 
neglect of the case of the noun dependent of TW, or the 
'Liarpor i0 
---'t 
q'.~. 
I 
I 
.-- ° 
J 
o 
~.° 
,.< 
o (";.b 
(--, 
~° 
0 
© 
0 
0"1 
0 
o 
20 
o 
.3" ~. 
' 0 
°-~. 
".< 
o 
"~ 0 
._ 
o 
o 
P~ 
o 
o 
o 
o 
O., 
o 
o 
P0 
o 
0 
0 0 0 0 
j ~ b ; 
0 0 0 
....... _-+ ............. ~- ......... ~ .... ~ ................. , .~ 
.... "r--- 
i 
? 
i 
I 
....... t ...... i ............. 4 .................. :: ............ : .............. 
: I 
.... ~ ......... 4 ........ 
Po 
o o c~ o o 
t 
; i 
........ -t ................... 
: ! 
: I 
i 
I 
i 
+ t -- 
i , i , 
L 
, ! 
I 
I 
÷ ................... ~ .................. ~ 
! 
............. i 
! 
r 
i : i 
// 
i i ................ L .............. i ' ~ ........................ .......... L ................... 
l-ia rper 11 
case of the TW when the SRW is a noun governor; the neglect 
of technical symbols in physical textj as dependent or 
governor of the TW; the failure to distinguish between 
different functions of governors or dependents in a noun/ 
noun pair (e.g., the distinction between "subjective" and 
"objective" genitive); the neglect of transformationally equi- 
valent constructions. In view of these deficiencies (not 
to mention the problem of statistics), the success of our 
fishing expedition is open to doubt. Let us then proceed 
to examine the catch. 
3. RI!SULTS 
The evaluation of the data contained in our three 
machine listin~.s is not an easy task. We can scarcely 
examine and discuss the degrees of sir.~ilarity of 780 noun- 
pairs. The problem of interpretation is also complicated: 
how completely and accurately should the results corres- 
pond with our expectations, as represented in the tentative 
semantic groupings (Table 1)? Our approach is to deal in 
a summary manner with the noun-pairs characterized by 
highest Similarity Coefficients, especially with respect to 
their intra- and inter-group relationships. Before proceeding 
to this discussion, a few preliminary remarks should be 
made about the data in the various machine listings. 
The summary of SRW counts for each TW, contained in 
Table I, suggests all TWs do not have the same opportunity 
for comparison. In the case of "correspondence" (Group 3), 
tlarper 12 
a total of only three SRW is noted in (Column 14); as a 
result, this TW should be eliminated from furtJler consider- 
ation. In addition, unless at least two, and preferably 
all three, types of SRW are well represented for a given 
TW, the SC for that noun will tend to be skewed. As 
examples, we note all nouns in Croup 6 (for which the 1,3 
column predominates), and the nouns in Group lO (for which 
the L2 column predominates). In effect, these nouns are 
"deficient" in certain types of SRI;', and require special 
handling. 
,t On the printout, "Ranking of Tl~-Pairs by SC, a 
number of noun pairs appear at the top end of the scale 
although the total number of shared SRW is small (i.e., the 
value of colurnn "NA" (see Table 4) is "1," "~,.," or "3." 
The SC may be high, because the product of the frequencies 
is relatively low. Our policy has been to discount these 
pairs on the grounds that the value of "NA" is significant 
in determining the similarity between two TWs. The minimum 
value for NA was arbitrarily set at four. 
Keeping in mind these anmndations to the data in mind, 
We proceed to the discussion of the noun-pairs character- 
ized by highest S(:. Table 3 shows the distribution of 
5(2 by noun-pairs. By any standard, the data shows nega- 
tive or extremely weak similarity for most of the 780 pairs. 
i A 
An abstract of a paper on tile proclivity of nouns to 
enter into certain combinations is cited in Reference 3. 
~,arper 13 
At which point on the curve shall we draw a line, saying 
that an SC above this value indicates similarity, a~d 
that an SC below this value indicates dissimilarity or 
weak similarity (all this of course: in terms of rcliability)? 
For purposes of discussion, we propose to set the t\]~reshold 
at .00100--a rigorously high figure. After eliminating 
pairs whose NA value is less than 4, we find 38 p,~irs whose 
SC lies in the range .00100 to .01~337 (Table 5). (Z\],e first 
two zeroes are dropped.) 
The reader may draw his own conclusions about the 
degree of similarity between the nouns in any given pair- 
ing. For purposes of discussion, we will refer to the 
pairings in terms of our preliminary groupings (Table I). 
The following intra- and inter-Group pairings are observed 
in Tab le 5 : 
Nouns of Group 1 pair with nouns of Group I, 2 
2 I, 2, i0 
3 
4 5 
5 4, 5, 6, 7 
6 5, 6, 
7 5, 7 
8 
9 9 
i0 2, I0 
ii 5, Ii 
We note that no pairings appear for nouns of Groups 3 
and 8. All other groups except Group 4 are represented by 
intra-group pairings; to this degree, our expectations 
are fulfilled, i.e., the data supports our a priori feel- 
ings for the similarity between words. The amount of inter- 
Harper 14 
Tab le 5 
"HIGH RANKING TW-PAIRS" 
TWI 
calculation I 
determination 
study 
investigation 
consideration 
liquid 
gas 
metal 
crystal 
copper 
ion 
atom 
height 
depth 
length 
absence 
presence 
1 
5 
5 
5 
7 
7 
i0 
i0 
TWJ 
calculation 2 
consideration 
determination 
investigation 
measurement 
study 
calculation 2 
consideration 
existence 
investigation 
absence 
calculation 2 
presence 
determination 
consideration 
calculation a 
absence 
existence 
calculation 2 
molecule 
problem 1 
metal 
crystal 
metal 
silver 
compound 
silver 
metal 
copper 
ion 
length 
width 
width 
e xi s tence 
calculation 2 
absence 
existence 
1 
2 
1 
2 
1 
2 
1 
2 
I0 
2 
i0 
1 
i0 
1 
2 
1 
I0 
I0 
i 
7 
11 
5 
5 
5 
7 
4 
6 
S 
6 
7 
9 
9 
9 
I0 
1 
I0 
I0 
SC 
m 
323 
285 
2OO 
183 
113 
101 
165 
337 
267 
246 
213 
139 
118 
116 
173 
154 
114 
107 
174 
143 
I05 
125 
104 
126 
194 
156 
180 
120 
106 
125 
155 
233 
125 
222 
101 
229 
225 
NA 
18 
9 
IS 
18 
23 
4 
18 
Ii 
7 
25 
8 
9 
9 
14 
22 
8 
7 
8 
9 
4 
6 
I0 
4 
8 
4 
6 
13 
4 
4 
12 
11 
question Ii problem 2 Ii 240 6 
ltarper 15 
group pairing may indicate either that the data is incon- 
clusive, or that our original groupings were too narrow. 
In fact, two larger groups emerge: one composed of Groups 
1 and 2 (perhaps including Group 1O), the other composed 
of Groups 4, 5, 6, and 7. This tendency is more marked 
if we lower the SC threshold from .00100 to .00070, 
thereby adding a total of 28 pairs to the number listed in 
Table 5. For example, nouns of Group 1 are found to pair 
with those of Group 10, and nouns of Group 4 pair with 
those of Groups 6 and 7. 
The data is not statistically conclusive, but strongly 
suggests the emergence of the two major groups mentioned 
above. The amalgamation of Groups 1 and 2 can easily be 
defended on semantic grounds; since Group 10, as noted 
above, is subject to aberrant behavior (because of the very 
high number of noun dependents), its inter-relation with 
Groups 1 and 2 may not be taken seriously. Groups 4, 5, 
6, and 7, which include the names of chemical mixtures, 
classes of elements, individual elements, and components of 
elements, may be taken together semantically as a single 
sub-class of "object nouns." The physicist tends to say the 
same things about all nouns in this group. 
One of tile 38 pairs listed in Table 5 appears to con- 
tradict expectation: "liquid"/"problem"(Groups 5 and Ii). 
It should also be noted that the noun dependents of 
Group i0 nouns serve a "subjective" rather than "objective" 
function. If we had distinguished between the syntactic 
function of the noun dependent, TWs of Group I0 would be only 
weakly similar to TWs of Groups 1 and 2. 
llarper 16 
Tile four SRW shared by those two nouns include the adjective 
"certain" and the noun governor "number." The non-discrim- 
inatory ("promiscuous") nature of these two SRW is perhaps 
obvious, and one of the refinelaents that should be intro- 
duced in future studies is the neglect of such words as 
"significant" SRI~. (Tile study of "promiscuity" in adjec- 
tives is referred to in Reference 4.) At the present, 
experience suggests that distortions introduced by such 
words are minimal if the number of SRW is sufficiently large. 
Our general conclusion is that, with a few anomalies, 
the 66 pairings for which the SC Is .00700 or higher 
meet with our expcctations. 
Another aspect of the question remains: many nouns 
with presumed similarity arc not represented on the high 
end of the SC distribution curve. (If we lower the thresh- 
old to include such pairs we shall also encounter many 
non-similar pairs.) One way of dealing with this problem 
is to consider the most highly correlated pairs that nouns 
in each Group form, whether or not the SC is "signifi- 
cantly" high. In lieu of presenting this information in 
full detail, we show in Table 6 the most closely correlated 
pairs for a representative noun from each of the Groups 
(excepting Groups 3, 4, and 8). 
The most striking aspect of Table 6 is the repetition 
of intra- and inter-Group pairings noted in Table S for 
high-SC pairings. In other words, the relative value of 
I~arper 17 
Z 
C C, 
E--, 0 
u 
0 
u'~ o 
C--. 
X 
0 < ,--, 
o 
~0~~ 
• ~ 0 0 ~ ~ 0 ~,~ ~u.~~ 0 ~ m 0 ¢~.c 0 ~ -.4 m r,,0u ~ ~.~ u 
0 ~ 0 ~ 
~N~ ~~ 
000 ~ O0 ~.~.~.~ ~ ~ ~.~.~ 
0~~ ~ 00~~ 
'~ ~ ~ ~ ~ 0 ,~.~ ~ ~ 00 
~ o,~,~ o o.~ ~ ~.~.~.~ o 
~.~ h ~ ~ ~o ~ ~ ~ ~ ~ ~ m ~ 
~o~ ~o~o ~~ooo 
u u~.~ ~ ~ ~ ~ u ~ u U,~ u ~ 
r-( 0 .,.~ 
0 
0 ~ 0 0 
N 
.~.~ ~ 0 bO¢~ ~.~ 
.~ ~., I~ ,~ 
O0 0 ~ 
0 
U 
u ~) "~ 
u~.~ :~ ~ 
0 
°,--I 
0t, e,a ,-~ C: 
.,~ ~1~ 0 
~ 0 ,-'~ .~ 
~ ~ 0 ~" 
0 ,-~ 0(.~_, 0 3 
,-~ 0 ,~ o ).( --j ~ ~.~ ~ .,~ ~ 
o .,.~ 0 4.a r-~ U ~ 0 
,-4 ~ ~'~--P-,O 0,-~ ~,-~ 0 
.,-( ~.,.4 0 ~--,~ ~.J 0 0 o 
o o 
ltarper 18 
the SC appears to be as significant as the absolute value. 
This result was certainly not expected, and perhaps indi- 
cates a greater sensitivity in our measurement procedures 
than we would have thought reasonable. 
Table 6 suggests, but does not prove, the existence 
of clusters (or "clumps") of T~s, in which the members are 
closely correlated with each other, and in which no member 
is closely correlated to any outside word. lee have not 
yet attempted to apply clumping procedures; a better 
understanding of the data is perhaps a prerequisite to this 
rigorous treatment. For the present, we shall point out 
a phenomenon that strongly suggests the existence of 
clumps: the recurrence of the same SRI~ ~ among several TWs 
with high mutual correlation. Consider, for example, that 
a high 5C is found between Test Words A and B) B and C, 
and A and C; if, in addition, a relatively high proportion 
of SRW are shared by all three Tl~s, the mutual connection 
of the three words would appear to be considerably strength- 
ened. The recurrence of SRW has not been systematically 
studied) but the following sample is offered as an illus- 
tration of the phenomenon. Below, we list all the SRW 
of the three types, for the \]'I~ calculation 1. The under- 
lined words are those which, in addition, also served as 
corresponding SRI; ~ for two other T;is (determination , and 
measurement ) that are highly correlated to each other and 
to calculation 1. 
tiarper 19 
Tab le 7 
SRW OF CALCULATION 1 
Adjective Dependents: 
(L1) 
Noun Dependents: (LZ) 
Noun Governors: 
(L3) 
TAKOJ (su~; ANALOGICNYJ (analogous) ; 
~EJSIJ (further); NAg (our); 
NEPOSREDSTVENNYJ (direct). 
ZAVISIMOST' (dependence); ~\[ASSA 
(mass); VJiLI~INA (magnitude '-~, 
SECENIE (cross=section) ; KOEFFICIENT 
~cient-~ NOI)UL' (modulus); 
RASSTOJANIE (distance); SILA (force); 
FORMA (form). 
ZRENIE (view) ; REZUL'TAT (result) ; 
~NO~T~--(pos s ib i I i ty);-~__ 
(method). 
Table 7 shows that eighteen SRW appeared for calculation I 
Of these, one half (nine) also appeared as SRW for both 
determination and measurement. It would seem that the 
"togetherness" of these three TWs is strengthened by this 
feature, which we term "recurrence of SR;V." We have no 
ready formula for determining that recurrence is or is not 
significant in a given situation. In general, the nature and 
behavior of individual SRIV remain to be studied, so far as 
their relevance to our problem is concerned. 
4. CONCLUS IONS 
We conclude that there is considerable agreement 
between the results of our experiment and an a priori feel- 
ing for the similarity of words. Words that are similar 
in meaning tend to have the same SRI',' t to a far greater 
degree than chance would determine. If this conclusion is 
valid, a large-scale experiment is suggested, using a 
larger number of Test Words, more SRW types, and a larger 
t la rper 2 0 
amount of text. (The text base for tile present experiment 
proved to be adequate; larger amounts of text should, 
however, remove some of the anomalies.) The question of 
further refinements in the procedure must also be taken 
seriously: e.g., we may also take into account multiple 
occurrences of an SRW, distinguish to some degree the dif- 
ferent functions of noun governors or noun dependents, dis- 
count the occurrence of "promiscuous" SRIV. (:lumping 
procedures should be applied, perhaps taking into account 
the recurrence of individual SRW among a group of Test 
Words. 
lta rper 21 

REFERENCES 

flays, D. G., and T. W. Ziehe, Russian Sentence-Structure 
i Determination, The RAND Corporation, R~I-2538, Ap'ril 1960. 

tlays, D. G., Basic Principles and Technical Variations 
in ,qentence-Struc~ure Determination, The RAND C0rporation, 
P-1981, April 1960. 

llarper, K. ti., "A Study of the Combinatorial Properties 
" Mechanical Translation, August 1963, of Russian Nouns, ..... 
p. 36. 

tlarper, K. E., Procedures for the Determination of Distri- 
butional Classes, "File RAND Corporation,-RM22~13, Janu d 
ary' 196i. 
