File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/65/c65-1010_metho.xml

Size: 21,792 bytes

Last Modified: 2025-10-06 14:10:58

<?xml version="1.0" standalone="yes"?>
<Paper uid="C65-1010">
  <Title>NEASURENENT OF SI~IILARITY I~ETWI!EN NOUNS</Title>
  <Section position="1" start_page="10" end_page="10" type="metho">
    <SectionTitle>
1965 International Conference on
Computational Linguistics
NEASURENENT OF SI~IILARITY I~ETWI!EN NOUNS
</SectionTitle>
    <Paragraph position="0"> A study was r~ade of tile degree of similarity between pairs of Russian nouns, as expressed by their tendency to occur in sentences with identical ~,,ords in identical syntactic relationships. A similarity matrix was prepared for forty nouns; for each pair of nouns the number of shared (i) adjective dependents, (ii) noun dependents, and (iii) noun governors was automatically retrieved from machine-processed text. The similarity coefficient for each pair ~;as determined as the ratio of the total of such shared ~'ords to the product of the frequencies of the two nouns in the text. The 78~ pairs were ranked according to this coefficient. The text comprised 12(1,~00 running words of physics text processed at The RAND Corporation; the frequencies of occurrence of the forty nouns in this text ranged from 42 to 328.</Paragraph>
    <Paragraph position="1"> The results suggest that the sample of text is of sufficient size to be useful for the intended purpose. Many noun pairs with similar properties (synonymy, antonym),, derivation from distributionally similar verbs, etc.) are characterized by high similarity coefficients; the converse is not observed. The relevance of various syntactic relationships as criteria for meas~rement is discussed.</Paragraph>
    <Paragraph position="2"> \[larper 1</Paragraph>
  </Section>
  <Section position="2" start_page="10" end_page="10" type="metho">
    <SectionTitle>
MEASURENIiNT OF SIMILARITY BETWEEN NOUNS
I. INTRODUCTION
</SectionTitle>
    <Paragraph position="0"> One of the goals of studies in Distributional Semantics is the establishment of word classes on the basis of the observed behavior of words in written texts. A convenient and significant way of discussing &amp;quot;behavior&amp;quot; of words is in terms of syntactic relationship. At the outset, in fact, it is necessary that we treat a word in terms of its Syntactically Related Words (SRW). In a given text, each word bears a given syntactic relationship to a finite number of other words; e.g., a finite number of words (nouns and pronouns) appear as &amp;quot;subject&amp;quot; for each active verb; another group of nouns and pronouns are used as &amp;quot;direct object&amp;quot; of each transitive verb; other words of the class, &amp;quot;adverb,&amp;quot; appear as modifiers of a given verb. In each instance we may speak of the related words as SRW of a given verb, so that in our example three different ~ of SRW emerge; a given SRW is then defined in terms both of word class and specific relationship to the verb. (A given noun may of course belong to two different types of SRW, e.g., as both subject and object of the same verb.) Distributionally, we may compare two verbs in terms of their SRN. The objective of the present study is to test the premise that &amp;quot;similar&amp;quot; words tend to have the same SRW. This premise is tested, not with verbs, as in the l,arper above example, but with nouns. Our procedure is (i) to find in a given text three types of SRW for a small group of nouns, (2) to find the number of Sill; T shared by each pair of nouns formed from the group, and (3) to express the &amp;quot;similarity&amp;quot; between individual nouns) and groups of nouns, as a function of their shared SRI~. Another example: it might turn out that in a given text the nouns &amp;quot;a&amp;quot; and &amp;quot;b&amp;quot; (&amp;quot;avocado&amp;quot; and &amp;quot;cherry&amp;quot;) share such adjective modifiers as &amp;quot;ripe,&amp;quot; whereas nouns &amp;quot;c )' and &amp;quot;d&amp;quot; (&amp;quot;chair&amp;quot; and &amp;quot;furniture&amp;quot;) have in common the adjective modifier &amp;quot;modern.&amp;quot; These facts would lead us to conclude that &amp;quot;a&amp;quot; and &amp;quot;b&amp;quot; are similar, that &amp;quot;c&amp;quot; and &amp;quot;d&amp;quot; are similar) that &amp;quot;a&amp;quot; and &amp;quot;c&amp;quot; are less similar, etc.</Paragraph>
    <Paragraph position="1"> A number of questions arise: What is &amp;quot;similarity&amp;quot; anyway? Do words that are similar in meaning really share a significant number of SRW in a given text? What is &amp;quot;a significant number&amp;quot;? Do not dissimilar words also have many common SRW? flow much text is necessary in order to establish patterns of word behavior? What is the effect of multiple=meaning in words, and of using, texts from differ = ent subject areas? The present investigation should be regarded as an experiment designed to throw some light on these questions; no validity is claimed for the &amp;quot;results&amp;quot; obtained. Our audacity in attempting the experiment at all is based on three factors: the possession of a text in a limited field (physics), the foreknowledge that the multiple = llarper 3 meaning probler:l is mininlal, and the capability for automatic processing of text. (The latter is clearly a necessity, in view oPS the size and complexity of the problem.) The reader may well conclude that the experiment proves nothing.</Paragraph>
    <Paragraph position="2"> We would hope, however, that such an opinion would not preclude a critical judgment of the procedures employed, or the suspension of disbelief if the results do not correspond with his expectations.</Paragraph>
  </Section>
  <Section position="3" start_page="10" end_page="22" type="metho">
    <SectionTitle>
2. PROCIiDIIRI'\]
</SectionTitle>
    <Paragraph position="0"> Tile present study was based on a series of articles from Russian physics journals, comprising approximately 120)000 running words (some 500 pages). The processinp, of this te.xt has been described elsewhere, (1'2) ltere, we note only that each sentence of this text is recorded on magnetic tape, together with the following information for each occurrence in the sentence: its part of speech, its &amp;quot;word number&amp;quot; (an identification number in the machine glossary}, and its syntactic &amp;quot;governor&amp;quot; or &amp;quot;dependent&amp;quot; (iPS any) in the sentence. A retrieval program applied to this text tape then yielded information about the SRI'i for words in which we were interested. For convenience and economy, all words in the machine printout for this study are identified by word number, rather than in their &amp;quot;naturallanguage&amp;quot; PSo rv).</Paragraph>
    <Paragraph position="1"> In our study we chose to deal wit\]~ the SRI~ of forty Russian nouns, herein called Test ~ords {TW). The number ltarper 4 is completely arbitrary; tile particular nouns chosen (see Table 1) a'ere presumed to form different semantic groupings. Table 1 gives one possible grouping of these words; the criteria for grouping are more or less obvious, although the reader may easily form different groups, by expanding or contracting the groups that we have designated. The only purpose of grouping is to provide a weak measure of control in the experiment: if two nouns are found to be similar in terms of their SRN, we should like to compare this finding with some intuitive understanding of their similarity. (For convenience, we shall refer to the 'rWs by their English equivalents.) Two nouns may be compared with reference to several different types of SRW. ilere, we have chosen to iimit our comparison to three types: t.t}e adjective dependents (in either attributive or predicative function), the noun depend.ents (normally, but not necessarily, in the genitive case in Russian), and the noun governors (the TN is normally, but not necessarily, in tile genitive case). Strictly speaking, the syntactic function of the SRIq should be taken into account. In ignoring this factor, we are consciously permitting certain inexactitudes, on the premise that the distortions introduced into measurement will not be severe.</Paragraph>
    <Paragraph position="2"> The task of manualiy retrieving SRW for each occurrence of the 40 TWs, and of comparing each TW with every other TW, is too tedious to be attempted. The aid of the computer was enlisted, in two ways, llarper 5</Paragraph>
    <Paragraph position="4"> &amp;quot;~ No.&amp;quot; = word number; &amp;quot;F&amp;quot; = frequency  tlarper 6 i. Through automatic scanning of the text, each occurrence of tile 40 TWs was located, and in each instance the identity (word number) of relevant SR~V was recorded. A listing is produced for each of the TWs (see Table 2, &amp;quot;SRW Detail,&amp;quot; for an example of the TW, VYCISLENIE = calculation 1), showing tile different words used as adjective dependents (List i), noun dependents (List 2), and noun governors (List 3). Tile number of words on each of these lists is also shown in Table i, together with the total number of SRW for each TW (List 4). We stress the fact that these numbers refer to different words used as SRW; the repetition of a given SRW (for a given SRW type) was not recorded.</Paragraph>
    <Paragraph position="5"> 2. Each Tl~ was automatically compared with every other TW, with respect to their shared SRW, i.e., in terms of the words i~ Lists I, 2, and 3 of the &amp;quot;SRW Detail Listing.&amp;quot; A new listing, &amp;quot;Similarity Ranking by T%~',&amp;quot; is then produced (see Table 3 for the T~, VYCISLENIE = calculationl). This listing shows for each TW the number of shared SRW of each of the three types (NI, N2, and N3, Table 3), the total number of shared SR%~ (NA), and a measure of similarity for the pairs, herein designated as the Similarity Coefficient (SC). The SC is a decimal fraction obtained by dividing the sum of shared SRW for each pair of TWs by the product of the frequencies of the two TWs. (The latter is of course a device for taking into account the differing frequencies</Paragraph>
    <Paragraph position="7"> of the TWs; other means for determining this coefficient can be utilized.) The pairings for each TW are ordered on the value of the SC. It should be noted that the similarity between TWs is measured in terms of the total number of shared SRW (Column NA of Table 3); it is also possible to express this measurement in terms of shared SRW of any single type.</Paragraph>
    <Paragraph position="8"> A third listing was also produced: a listing of the 7,~I) TI~'-pairs, ordered oll the value of the SC. This listing, not reproduced here because of its length, will be referred to as &amp;quot;Ranking of TW-Pairs by SC~.&amp;quot; 'Fable 4 shows the distribution of the SC as compared with tile number of TW pairs. The following discussion is based on the three listings described above. A few additional remarks may be made about the procedure itself, which may be likened to deep-sea fishing with a tea strainer full of holes. The limitations of size are obvious: we have limited ourselves to three of the numerous ways of comparing nouns in terms of their SRW. Other types of SRW that suggest themselves are: verbsj where TW is subject; verbs, where TW is direct object; prepositional phrases as dependents, or governors, of TW; nouns joined to TW through coordinate conjunctions (i.e., &amp;quot;apples&amp;quot; and &amp;quot;grapes&amp;quot; are said to be more similar if &amp;quot;apples and oranges&amp;quot; and &amp;quot;grapes and oranges&amp;quot; occur in text). Some of the holes in our tea strainer are: the neglect of the case of the noun dependent of TW, or the</Paragraph>
    <Paragraph position="10"> case of the TW when the SRW is a noun governor; the neglect of technical symbols in physical textj as dependent or governor of the TW; the failure to distinguish between different functions of governors or dependents in a noun/ noun pair (e.g., the distinction between &amp;quot;subjective&amp;quot; and &amp;quot;objective&amp;quot; genitive); the neglect of transformationally equivalent constructions. In view of these deficiencies (not to mention the problem of statistics), the success of our fishing expedition is open to doubt. Let us then proceed to examine the catch.</Paragraph>
  </Section>
  <Section position="4" start_page="22" end_page="22" type="metho">
    <SectionTitle>
3. RI!SULTS
</SectionTitle>
    <Paragraph position="0"> The evaluation of the data contained in our three machine listin~.s is not an easy task. We can scarcely examine and discuss the degrees of sir.~ilarity of 780 nounpairs. The problem of interpretation is also complicated: how completely and accurately should the results correspond with our expectations, as represented in the tentative semantic groupings (Table 1)? Our approach is to deal in a summary manner with the noun-pairs characterized by highest Similarity Coefficients, especially with respect to their intra- and inter-group relationships. Before proceeding to this discussion, a few preliminary remarks should be made about the data in the various machine listings.</Paragraph>
    <Paragraph position="1"> The summary of SRW counts for each TW, contained in Table I, suggests all TWs do not have the same opportunity for comparison. In the case of &amp;quot;correspondence&amp;quot; (Group 3), tlarper 12 a total of only three SRW is noted in (Column 14); as a result, this TW should be eliminated from furtJler consideration. In addition, unless at least two, and preferably all three, types of SRW are well represented for a given TW, the SC for that noun will tend to be skewed. As examples, we note all nouns in Croup 6 (for which the 1,3 column predominates), and the nouns in Group lO (for which the L2 column predominates). In effect, these nouns are &amp;quot;deficient&amp;quot; in certain types of SRI;', and require special handling.</Paragraph>
    <Paragraph position="2"> ,t On the printout, &amp;quot;Ranking of Tl~-Pairs by SC, a number of noun pairs appear at the top end of the scale although the total number of shared SRW is small (i.e., the value of colurnn &amp;quot;NA&amp;quot; (see Table 4) is &amp;quot;1,&amp;quot; &amp;quot;~,.,&amp;quot; or &amp;quot;3.&amp;quot; The SC may be high, because the product of the frequencies is relatively low. Our policy has been to discount these pairs on the grounds that the value of &amp;quot;NA&amp;quot; is significant in determining the similarity between two TWs. The minimum value for NA was arbitrarily set at four.</Paragraph>
    <Paragraph position="3"> Keeping in mind these anmndations to the data in mind, We proceed to the discussion of the noun-pairs characterized by highest S(:. Table 3 shows the distribution of 5(2 by noun-pairs. By any standard, the data shows negative or extremely weak similarity for most of the 780 pairs.</Paragraph>
    <Paragraph position="5"> An abstract of a paper on tile proclivity of nouns to enter into certain combinations is cited in Reference 3.</Paragraph>
    <Paragraph position="6"> ~,arper 13 At which point on the curve shall we draw a line, saying that an SC above this value indicates similarity, a~d that an SC below this value indicates dissimilarity or weak similarity (all this of course: in terms of rcliability)? For purposes of discussion, we propose to set the t\]~reshold at .00100--a rigorously high figure. After eliminating pairs whose NA value is less than 4, we find 38 p,~irs whose SC lies in the range .00100 to .01~337 (Table 5). (Z\],e first two zeroes are dropped.) The reader may draw his own conclusions about the degree of similarity between the nouns in any given pairing. For purposes of discussion, we will refer to the pairings in terms of our preliminary groupings (Table I). The following intra- and inter-Group pairings are observed in Tab le 5 :</Paragraph>
    <Paragraph position="8"> We note that no pairings appear for nouns of Groups 3 and 8. All other groups except Group 4 are represented by intra-group pairings; to this degree, our expectations are fulfilled, i.e., the data supports our a priori feelings for the similarity between words. The amount of inter- null question Ii problem 2 Ii 240 6 ltarper 15 group pairing may indicate either that the data is inconclusive, or that our original groupings were too narrow. In fact, two larger groups emerge: one composed of Groups 1 and 2 (perhaps including Group 1O), the other composed of Groups 4, 5, 6, and 7. This tendency is more marked if we lower the SC threshold from .00100 to .00070, thereby adding a total of 28 pairs to the number listed in Table 5. For example, nouns of Group 1 are found to pair with those of Group 10, and nouns of Group 4 pair with those of Groups 6 and 7.</Paragraph>
    <Paragraph position="9"> The data is not statistically conclusive, but strongly suggests the emergence of the two major groups mentioned above. The amalgamation of Groups 1 and 2 can easily be defended on semantic grounds; since Group 10, as noted above, is subject to aberrant behavior (because of the very high number of noun dependents), its inter-relation with Groups 1 and 2 may not be taken seriously. Groups 4, 5, 6, and 7, which include the names of chemical mixtures, classes of elements, individual elements, and components of elements, may be taken together semantically as a single sub-class of &amp;quot;object nouns.&amp;quot; The physicist tends to say the same things about all nouns in this group.</Paragraph>
    <Paragraph position="10"> One of tile 38 pairs listed in Table 5 appears to contradict expectation: &amp;quot;liquid&amp;quot;/&amp;quot;problem&amp;quot;(Groups 5 and Ii). It should also be noted that the noun dependents of Group i0 nouns serve a &amp;quot;subjective&amp;quot; rather than &amp;quot;objective&amp;quot; function. If we had distinguished between the syntactic function of the noun dependent, TWs of Group I0 would be only weakly similar to TWs of Groups 1 and 2.</Paragraph>
    <Paragraph position="11"> llarper 16 Tile four SRW shared by those two nouns include the adjective &amp;quot;certain&amp;quot; and the noun governor &amp;quot;number.&amp;quot; The non-discriminatory (&amp;quot;promiscuous&amp;quot;) nature of these two SRW is perhaps obvious, and one of the refinelaents that should be introduced in future studies is the neglect of such words as &amp;quot;significant&amp;quot; SRI~. (Tile study of &amp;quot;promiscuity&amp;quot; in adjectives is referred to in Reference 4.) At the present, experience suggests that distortions introduced by such words are minimal if the number of SRW is sufficiently large. Our general conclusion is that, with a few anomalies, the 66 pairings for which the SC Is .00700 or higher meet with our expcctations.</Paragraph>
    <Paragraph position="12"> Another aspect of the question remains: many nouns with presumed similarity arc not represented on the high end of the SC distribution curve. (If we lower the threshold to include such pairs we shall also encounter many non-similar pairs.) One way of dealing with this problem is to consider the most highly correlated pairs that nouns in each Group form, whether or not the SC is &amp;quot;significantly&amp;quot; high. In lieu of presenting this information in full detail, we show in Table 6 the most closely correlated pairs for a representative noun from each of the Groups (excepting Groups 3, 4, and 8).</Paragraph>
    <Paragraph position="13"> The most striking aspect of Table 6 is the repetition of intra- and inter-Group pairings noted in Table S for high-SC pairings. In other words, the relative value of</Paragraph>
    <Paragraph position="15"> the SC appears to be as significant as the absolute value.</Paragraph>
    <Paragraph position="16"> This result was certainly not expected, and perhaps indicates a greater sensitivity in our measurement procedures than we would have thought reasonable.</Paragraph>
    <Paragraph position="17"> Table 6 suggests, but does not prove, the existence of clusters (or &amp;quot;clumps&amp;quot;) of T~s, in which the members are closely correlated with each other, and in which no member is closely correlated to any outside word. lee have not yet attempted to apply clumping procedures; a better understanding of the data is perhaps a prerequisite to this rigorous treatment. For the present, we shall point out a phenomenon that strongly suggests the existence of clumps: the recurrence of the same SRI~ ~ among several TWs with high mutual correlation. Consider, for example, that a high 5C is found between Test Words A and B) B and C, and A and C; if, in addition, a relatively high proportion of SRW are shared by all three Tl~s, the mutual connection of the three words would appear to be considerably strengthened. The recurrence of SRW has not been systematically studied) but the following sample is offered as an illustration of the phenomenon. Below, we list all the SRW of the three types, for the \]'I~ calculation 1. The underlined words are those which, in addition, also served as corresponding SRI; ~ for two other T;is (determination , and measurement ) that are highly correlated to each other and to calculation 1.</Paragraph>
    <Paragraph position="18">  ~NO~T~--(pos s ib i I i ty);-~__ (method).</Paragraph>
    <Paragraph position="19"> Table 7 shows that eighteen SRW appeared for calculation I Of these, one half (nine) also appeared as SRW for both determination and measurement. It would seem that the &amp;quot;togetherness&amp;quot; of these three TWs is strengthened by this feature, which we term &amp;quot;recurrence of SR;V.&amp;quot; We have no ready formula for determining that recurrence is or is not significant in a given situation. In general, the nature and behavior of individual SRIV remain to be studied, so far as their relevance to our problem is concerned.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML