File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/90/p90-1034_metho.xml
Size: 20,908 bytes
Last Modified: 2025-10-06 14:12:37
<?xml version="1.0" standalone="yes"?> <Paper uid="P90-1034"> <Title>NOUN CLASSIFICATION FROM PREDICATE.ARGUMENT STRUCTURES</Title> <Section position="1" start_page="0" end_page="0" type="metho"> <SectionTitle> NOUN CLASSIFICATION FROM PREDICATE.ARGUMENT STRUCTURES Donald Hindle AT&T Bell Laboratories </SectionTitle> <Paragraph position="0"/> </Section> <Section position="2" start_page="0" end_page="0" type="metho"> <SectionTitle> ABSTRACT </SectionTitle> <Paragraph position="0"> A method of determining the similarity of nouns on the basis of a metric derived from the distribution of subject, verb and object in a large text corpus is described. The resulting quasi-semantic classification of nouns demonstrates the plausibility of the distributional hypothesis, and has potential application to a variety of tasks, including automatic indexing, resolving nominal compounds, and determining the scope of modification.</Paragraph> </Section> <Section position="3" start_page="0" end_page="268" type="metho"> <SectionTitle> 1. INTRODUCTION </SectionTitle> <Paragraph position="0"> A variety of linguistic relations apply to sets of semantically similar words. For example, modifiers select semantically similar nouns, selecfional restrictions are expressed in terms of the semantic class of objects, and semantic type restricts the possibilities for noun compounding. Therefore, it is useful to have a classification of words into semantically similar sets. Standard approaches to classifying nouns, in terms of an &quot;is-a&quot; hierarchy, have proven hard to apply to unrestricted language.</Paragraph> <Paragraph position="1"> Is-a hierarchies are expensive to acquire by hand for anything but highly restricted domains, while attempts to automatically derive these hierarchies from existing dictionaries have been only partially successful (Chodorow, Byrd, and Heidom 1985).</Paragraph> <Paragraph position="2"> This paper describes an approach to classifying English words according to the predicate-argument structures they show in a corpus of text. The general idea is straightforward: in any natural language there ate restrictions on what words can appear together in the same construction, and in particular, on what can he arguments of what predicates. For nouns, there is a restricted set of verbs that it appears as subject of or object of. For example, wine may be drunk, produced, and sold but not pruned. Each noun may therefore he characterized according to the verbs that it occurs with. Nouns may then he grouped according to the extent to which they appear in similar environments.</Paragraph> <Paragraph position="3"> This basic idea of the distributional foundation of meaning is not new. Hams (1968) makes this &quot;distributional hypothesis&quot; central to his linguistic theory. His claim is that: &quot;the meaning of entities, and the meaning of grammatical relations among them, is related to the restriction of combinations of these entities relative to other entities.&quot; (Harris 1968:12). Sparck Jones (1986) takes a similar view.</Paragraph> <Paragraph position="4"> It is however by no means obvious that the distribution of words will directly provide a useful semantic classification, at least in the absence of considerable human intervention. The work that has been done based on Harris' distributional hypothesis (most notably, the work of the associates of the Linguistic String Project (see for example, Hirschman, Grishman, and Sager 1975)) unfortunately does not provide a direct answer, since the corpora used have been small (tens of thousands of words rather than millions) and the analysis has typically involved considerable intervention by the researchers. The stumbling block to any automatic use of distributional patterns has been that no sufficiently robust syntactic analyzer has been available.</Paragraph> <Paragraph position="5"> This paper reports an investigation of automatic distributional classification of words in English, using a parser developed for extracting grammatical structures from unrestricted text (Hindle 1983). We propose a particular measure of similarity that is a function of mutual information estimated from text.</Paragraph> <Paragraph position="6"> On the basis of a six million word sample of Associated Press news stories, a classification of nouns was developed according to the predicates they occur with. This purely syntax-based similarity measure shows remarkably plausible semantic relations.</Paragraph> </Section> <Section position="4" start_page="268" end_page="269" type="metho"> <SectionTitle> 2. ANALYZING THE CORPUS </SectionTitle> <Paragraph position="0"> A 6 million word sample of Associated Press news stories was analyzed, one sentence at a time,</Paragraph> <Paragraph position="2"> the land that t * sustains us</Paragraph> <Paragraph position="4"> and many of the products we</Paragraph> <Paragraph position="6"> by a deterministic parser (Fidditch) of the sort originated by Marcus (1980). Fidditch provides a single syntactic analysis -- a tree or sequence of trees -- for each sentence; Figure 1 shows part of the output for sentence (1).</Paragraph> <Paragraph position="7"> (1) The clothes we wear, the food we eat, the air we breathe, the water we drink, the land that sustains us, and many of the products we use are the result of agricultural research. (March 22 1987) The parser aims to be non-committal when it is unsure of an analysis. For example, it is perfectly willing to parse an embedded clause and then leave it unattached. If the object or subject of a clause is not found, Fidditch leaves it empty, as in the last two clauses in Figure 1. This non-committal approach simply reduces the effective size of the sample.</Paragraph> <Paragraph position="8"> The aim of the parser is to produce an annotated surface structure, building constituents as large as it can, and reconstructing the underlying clause structure when it can. In sentence (1), six clauses are found. Their predicate-argument information may be coded as a table of 5-tuples, consisting of verb, surface subject, surface object, underlying subject, underlying object, as shown in Table 1. In the subject-verb-object table, the root form of the head of phrases is recorded, and the deep subject and object are used when available. (Noun phrases of the form a nl of n2 are coded as nl n2; an example is the first entry in Table 2).</Paragraph> <Paragraph position="9"> The parser's analysis of sentence (1) is far from perfect: the object of wear is not found, the object of use is not found, and the single element land rather than the conjunction of clothes, food, air, water, land, products is taken to be the subject of be. Despite these errors, the analysis is succeeds in discovering a number of the correct predicate-argument relations. The parsing errors that do occur seem to result, for the current purposes, in the omission of predicate-argument relations, rather than their misidentification. This makes the sample less effective than it might be, but it is not in general misleading. (It may also skew the sample to the extent that the parsing errors are consistent.) The analysis of the 6 million word 1987 AP sample yields 4789 verbs in 274613 clausal structures, and 267zt2 head nouns. This table of predicate-argument relations is the basis of our similarity metric.</Paragraph> </Section> <Section position="5" start_page="269" end_page="270" type="metho"> <SectionTitle> 3. TYPICAL ARGUMENTS </SectionTitle> <Paragraph position="0"> For any of verb in the sample, we can ask what nouns it has as subjects or objects. Table 2 shows the objects of the verb drink that occur (more than once) in the sample, in effect giving the answer to the question &quot;what can you drink?&quot; This list of drinkable things is intuitively quite good. The objects in Table 2 are ranked not by raw frequency, but by a cooccurrence score listed in the last column. The idea is that, in ranking the importance of noun-verb associations, we are interested not in the raw frequency of cooccurrence of a predicate and argument, but in their frequency normalized by what we would expect. More is to be learned from the fact that you can drink wine than from the fact that you can drink it even though there are more clauses in our sample with # as an object of drink than with wine. To capture this intuition, we turn, following Church and Hanks (1989), to &quot;mutual information&quot; (see Fano 1961). The mutual information of two events l(x y) is defined as follows:</Paragraph> <Paragraph position="2"> where P(x y) is the joint probability of events x and y, and P(x) and P(y) axe the respective independent probabilities. When the joint probability P(x y) is high relative to the product of the independent probabilities, I is positive; when the joint probability is relatively low, I is negative. We use the observed frequencies to derive a cooccurrence score Cobj (an estimate of mutual information) defined as follows.</Paragraph> <Paragraph position="4"> where fin v) is the frequency of noun n occurring as object of verb v, f(n) is the frequency of the noun n occurring as argument of any verb, f(v) is the frequency of the verb v, and N is the count of clauses in the sample. (C,,,bi(n v) is defined analogously.) Calculating the cooccurrence weight for drink, shown in the third column of Table 2, gives us a reasonable tanking of terms, with it near the bottom.</Paragraph> <Section position="1" start_page="270" end_page="270" type="sub_section"> <SectionTitle> Multiple Relationships </SectionTitle> <Paragraph position="0"> For any two nouns in the sample, we can ask what verb contexts they share. The distributional hypothesis is that nouns axe similar to the extent that they share contexts. For example, Table 3 shows all the verbs which wine and beer can be objects of, highlighting the three verbs they have in common. The verb drink is the key common factor. There are of course many other objects that can be sold, but most of them are less alike than wine or beer because they can't also be drunk. So for example, a car is an object that you can have and sell, like wine and beer, but you do not -- in this sample (confirming what we know from the meanings of the words) -typically drink a car.</Paragraph> </Section> </Section> <Section position="6" start_page="270" end_page="274" type="metho"> <SectionTitle> 4. NOUN SIMILARITY </SectionTitle> <Paragraph position="0"> We propose the following metric of similarity, based on the mutual information of verbs and arguments. Each noun has a set of verbs that it occurs with (either as subject or object), and for each such relationship, there is a mutual information value. For each noun and verb pair, we get two mutual information values, for subject and object, Csubj(Vi nj) and Cobj(1Ji nj) We define the object similarity of two nouns with respect to a verb in terms of the minimum shared coocccurrence weights, as in (2).</Paragraph> <Paragraph position="1"> The subject similarity of two nouns, SIMs~j, is defined analogously.</Paragraph> <Paragraph position="2"> Now define the overall similarity of two nouns as the sum across all verbs of the object similarity and the subject similarity, as in (3). (2) Object similarity.</Paragraph> <Paragraph position="4"> (3) Noun similarity.</Paragraph> <Paragraph position="6"> The metric of similarity in (2) and (3) is but one of many that might be explored, but it has some useful properties. Unlike an inner product measure, it is guaranteed that a noun will be most similar to itself. And unlike cosine distance, this metric is roughly proportional to the number of different verb contexts that are shared by two nouns.</Paragraph> <Paragraph position="7"> Using the definition of similarity in (3), we can begin to explore nouns that show the greatest similarity. Table 4 shows the ten nouns most similar to boat, according to our similarity metric. The first column lists the noun which is similar to boat. The second column in each table shows the number of instances that the noun appears in a predicate-argument pair (including verb environments not in the list in the fifth column). The third column is the number of distinct verb environments (either subject or object) that the noun occurs in which are shared with the target noun of the table.</Paragraph> <Paragraph position="8"> Thus, boat is found in 79 verb environment. Of these, ship shares 25 common environments (ship also occurs in many other unshared environments). The fourth column is the measure of similarity of the noun with the target noun of the table, SIM(nln2), as defined above.</Paragraph> <Paragraph position="9"> The fifth column shows the common verb environments, ordered by cooccurrence score, C(vinj), as defined above. An underscore before the verb indicates that it is a subject environment; a following underscore indicates an object environment. In Table 4, we see that boat is a subject of cruise, and object of sink. In the list for boat, in column five, cruise appears earlier in the list than carry because cruise has a higher cooccurrence score. A - before a verb means that the cooccurrence score is negative -i.e. the noun is less likely to occur in that argument context than expected.</Paragraph> <Paragraph position="10"> For many nouns, encouragingly appropriate sets of semantically similar nouns are found.</Paragraph> <Paragraph position="11"> Thus, of the ten nouns most similar to boat similar noun is the near-synonym ship. The ten nouns most similar to treaty (agreement, plan, constitution, contract, proposal, accord, amendment, rule, law, legislation) seem to make up a duster involving the notions of agreement and rule. Table 5 shows the ten nouns most similar to legislator, again a fairly coherent set. Of course, not all nouns fall into such neat clusters: Table 6 shows a quite heterogeneous group of nouns similar to table, though even here the most similar word (floor) is plausible. We need, in further work, to explore both automatic and supervised means of discriminating the semantically relevant associations from the spurious.</Paragraph> <Section position="1" start_page="271" end_page="272" type="sub_section"> <SectionTitle> Verbs </SectionTitle> <Paragraph position="0"> _cruise, keel_, _plow, sink_, drift_, step off_, step from_, dock_, righ L, submerge , near, hoist , intercept, charter, stay on_, buzz_, stabilize_, _sit on, intercept, hijack_, park_, _be from, rock, get off_, board, miss_, stay with_, catch, yield-, bring in_, seize_, pull_, grab , hit, exclude_, weigh_, _issue, demonstrate, _force, _cover, supply_, _name, attack, damage_, launch_, _provide, appear , carry, _go to, look a L, attack_, _reach, _be on, watch_, use_, return_, _ask, destroy_, fire, be on_, describe_, charge_, include_, be in_, report_, identify_, expec L, cause , 's , 's, take, _make, &quot;be_,-say, &quot;give_, see ,&quot; be, &quot;have_, &quot;get _near, charter, hijack_, get off_, buzz_, intercept, board_, damage, sink_, seize, _carry, attack_, &quot;have_, _be on, _hit, destroy_, watch_, _go to, &quot;give , ask, &quot;be_, be on_, &quot;say_, identify, see_ hijack_, intercept_, charter, board_, get off, _near, _attack, _carry, seize_, -have_, _be on, _catch, destroy_, _hit, be on_, damage_, use_, -be_, _go to, _reach, &quot;say_, identify_, _provide, expect, cause-, seestep off_., hijack_, park_, get off, board , catch, seize-, _carry, attack_, _be on, be on_, charge_, expect_, &quot;have , take, &quot;say_, _make, include_, be in , &quot; be charter, intercept, hijack_, park_, board , hit, seize-, _attack, _force, carry, use_, describe_, include , be on, &quot;_be, _make, -say_ right-, dock, intercept, sink_, seize , catch, _attack, _carry, attack_, &quot;have_, describe_, identify_, use_, report_, &quot;be_, &quot;say_, expec L, &quot;give_ park_, intercept-, stay with_, _be from, _hit, seize, damage_, _carry, teach, use_, return_, destroy_, attack , &quot; be, be in , take, -have_, -say_, _make, include_, see_ step from_, park_, board , hit, _catch, pull , carry, damage_, destroy_, watch_, miss_, return_, &quot;give_, &quot;be , - be, be in_, -have_, -say_, charge_, _'s, identify_, see , take, -get_ hijack_, park_, board_, bring in , catch, _attack, watch_, use_, return_, fire_, _be on, include , make, -_be dock_, sink_, board-, pull_, _carry, use_, be on_, cause , take, &quot;say_ hoist_, bring in_, stay with_, _attack, grab, exclude , catch, charge_, -have_, identify_, describe_, &quot;give , be from, appear_, _go to, carry, _reach, _take, pull_, hit, -get , 's , attack_, cause_, _make, &quot;_be, see , cover, _name, _ask</Paragraph> </Section> <Section position="2" start_page="272" end_page="272" type="sub_section"> <SectionTitle> Verbs </SectionTitle> <Paragraph position="0"> cajole , thump, _grasp, convince_, inform_, address , vote, _predict, _address, _withdraw, _adopt, _approve, criticize_, _criticize, represent, _reach, write , reject, _accuse, support_, go to_, _consider, _win, pay_, allow_, tell , hold, call__, _kill, _call, give_, _get, say , take, &quot;__be _vote, address_, _approve, inform_, _reject, go to_, _consider, adopt, tell , - be, give_ _vote, _approve, go to_, inform_, _reject, tell , &quot; be, convince_, _hold, address_, _consider, _address, _adopt, call_, criticize, allow_, support_, _accuse, give_, _call adopt, inform_, address, go to_, _predict, support_, _reject, represent_, _call, _approve, -_be, allow , take, say_, _hold, tell_ _reject, _vote, criticize_, convince-, inform_, allow , accuse, _address, _adopt, &quot;_be, _hold, _approve, give_, go to_, tell_, _consider, pay_ convince_, approve, criticize_, _vote, _address, _hold, _consider, &quot;_.be, call_, give, say_, _take -vote, inform_, _approve, _adopt, allow_, _reject, _consider, _reach, tell_, give , &quot; be, call, say_ -criticize, _approve, _vote, _predict, tell , reject, _accuse, &quot;__be, call_, give , consider, _win, _get, _take _vote, approve, convince_, tell , reject, _adopt, _criticize, _.consider, &quot;__be, _hold, give, _reach inform_, _approve, _vote, tell_, _consider, convince_, go to , &quot; be, address_, give_, criticize_, address, _reach, _adopt, _hold reach, _predict, criticize , withdraw, _consider, go to , hold, -_be, _accuse, support_, represent_, tell_, give_, allow , take</Paragraph> </Section> <Section position="3" start_page="272" end_page="274" type="sub_section"> <SectionTitle> Verbs </SectionTitle> <Paragraph position="0"> hide beneath_, convolute_, memorize_, sit at, sit across_, redo_, structure_, sit around_, fitter, _carry, lie on_, go from_, hold, wait_, come to, return to, turn_, approach_, cover, be on-, share, publish_, claim_, mean_, go to, raise_, leave_, &quot;have_, do , be litter, lie on-, cover, be on-, come to_, go to_ _carry, be on-, cover, return to_, turn_, go to._, leave_, &quot;have_ approach_, retum to_, mean_, go to, be on-, turn_, come to_, leave_, do_, be_ go from_, come to_, return to_, claim_, go to_, &quot;have_, do_ structure_, share_, claim_, publish_, be_ sit across_, mean_, be on-, leave_ litter,, approach_, go to_, return to_, come to_, leave_ lie on_, be on-, go to_, _hold, &quot;have_, cover, leave._, come to_ go from_, come to_, cover, return to_, go to_, leave_, &quot;have_ return to_, claim_, come to_, go to_, cover_, leave_ Reciprocally most similar nouns We can define &quot;reciprocally most similar&quot; nouns or &quot;reciprocal nearest neighbors&quot; (RNN) as two nouns which are each other's most similar noun. This is a rather stringent definition; under this definition, boat and ship do not qualify because, while ship is the most similar to boat, the word most similar to ship is not boat but plane (boat is second). For a sample of all the 319 nouns of frequency greater than 100 and less than 200, we asked whether each has a reciprocally most similar noun in the sample. For this sample, 36 had a reciprocal nearest neighbor. These are shown in Table 7 (duplicates are shown only once).</Paragraph> <Paragraph position="1"> The list in Table 7 shows quite a good set of substitutable words, many of which axe neat synonyms. Some are not synonyms but are nevertheless closely related: economist - analyst, 2 - 3. Some we recognize as synonyms in news reporting style: explosion - blast, bomb - device, tie - relation. And some are hard to interpret. Is the close relation between star and editor some reflection of news reporters' world view? Is list most like fieM because neither one has much meaning by itself?.</Paragraph> </Section> </Section> class="xml-element"></Paper>