File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-2605_metho.xml
Size: 26,958 bytes
Last Modified: 2025-10-06 14:09:22
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-2605"> <Title>Using Selectional Profile Distance to Detect Verb Alternations</Title> <Section position="2" start_page="0" end_page="0" type="metho"> <SectionTitle> 1 Detecting Verb Alternations </SectionTitle> <Paragraph position="0"> Although patterns of verb alternations, as in (1) and (2), may appear to be &quot;mere&quot; syntactic variation, the ability of a verb to alternate has been shown to be highly related to its semantic properties.</Paragraph> <Paragraph position="1"> 1. The sun melted the snow./The snow melted.</Paragraph> <Paragraph position="2"> 2. Kiva ate his lunch./Kiva ate./*His lunch ate. For example, melt in (1) undergoes a causative alternation in which the transitive form is related to the intransitive by the introduction of a Causal Agent (the sun) into the event structure. The verb eat in (2), like melt, allows both transitive and intransitive forms, but these are related by the unspecified object alternation, as opposed to causativization.</Paragraph> <Paragraph position="3"> Based largely on the influence of Levin (1993), it has become widely accepted that alternations such as these can serve as a basis for the formation of semantic classes of verbs. Correspondingly, the relation between alternation patterns and meaning is a key focus in the computational study of the lexical semantics of verbs (e.g., Allen, 1997; Dang et al., 2000; Dorr and Jones, 2000; Merlo and Stevenson, 2001; Schulte im Walde and Brew, 2002; Tsang et al., 2002). Furthermore, we note that recent work indicates that verb alternations may also play a role in automatic processing of language for applied tasks, such as question-answering (Katz et al., 2001), detection of text relations (Teufel, 1999), and determination of verb-particle constructions (Bannard, 2002).</Paragraph> <Paragraph position="4"> The theoretical and practical implications of alternations mean that it is important to identify verbs which undergo an alternation, and to discover the range of alternations. Manual annotation of verbs is labour intensive, and new verbs (or new uses of known verbs) may be encountered in any given domain. In response, some researchers have begun to investigate ways to detect alternations automatically in a corpus. Some of this work has focused on subcategorization patterns as the clear syntactic cue to an alternation (Lapata, 1999; Lapata and Brew, 1999; Schulte im Walde and Brew, 2002). Other work has observed, however, that detecting an alternation involves more than observing the use of particular subcategorizations--it must also be determined whether the semantic arguments are mapped to the appropriate positions.1 null To address this issue, it has been suggested that, if a verb participates in an alternation, then there should be similarity in the kinds of nouns that show up in the syn1For example, melt (as in (1) above) undergoes a causative alternation because the Theme argument that surfaces as subject of the intransitive surfaces as object of the transitive, with the addition of a Causal Agent as the subject of the latter. It is not the case that any optionally intransitive verb undergoes this alternation, as shown by eat in (2).</Paragraph> <Paragraph position="5"> tactic positions (or slots) that alternate--such as snow occurring as intransitive subject and transitive object in the causative alternation in (1) (Merlo and Stevenson, 2001; McCarthy, 2000). As a cue to this alternation, Merlo and Stevenson (2001) create a bag of head nouns for each of the two potentially alternating slots, and compare them.</Paragraph> <Paragraph position="6"> In contrast to comparing head nouns directly, McCarthy (2000) instead compares the selectional preferences for each of the two slots (captured by a probability distribution over WordNet). This approach thereby generalizes over the compared nouns, increasing performance over a method similar to that of Merlo and Stevenson.</Paragraph> <Paragraph position="7"> In our work, we have developed a new method for comparing WordNet probability distributions, called &quot;selectional profile distance&quot; (SPD), which combines the benefits of each of the above approaches for detecting alternations. The method used by Merlo and Stevenson (2001) has the advantage of directly capturing similarity between slots (in terms of use of identical nouns [lemmas]), but fails to generalize over the nouns, lending itself to sparse data problems. The approach of Mc-Carthy (2000), on the other hand, addresses the generalization problem by comparing probability distributions over WordNet. However, her comparison measure abstracts over distances between nodes (classes of nouns) in WordNet: it rewards probability mass that occurs in the same subtree across two distributions, but does not take into account the distance between the classes that carry the probability mass. Thus, this approach only captures similarity among the noun arguments across slots at a very coarse level. Our new SPD method integrates a comparison of probability distributions over WordNet with a node similarity measure, successfully capturing both of the advantageous properties of generalization and word (class) similarity. SPD thus enables us to calculate a meaningful similarity measure over the patterns of classes of nouns across two syntactic slots.</Paragraph> <Paragraph position="8"> Our evaluation of the SPD measure for alternation detection also covers some interesting experimental conditions that have not been explored previously. For comparison to previous methods, we investigate these issues in the context of classifying verbs according to whether they undergo the causative alternation. We experiment with randomly selected verbs, for both our alternating and non-alternating (filler) classes, and use both relatively homogeneous and heterogeneous sets of filler verbs. We find that our method performs about the same on each set, indicating that it is insensitive to variation in the filler verbs. Moreover, we experiment with equal numbers of verbs in different frequency bands, and show that splitting verbs into high and low frequency (of slot occurrence) can improve performance. By classifying the high and low frequency verbs separately, our method achieves an accuracy of 70% overall on unseen test verbs, in a task with a baseline of 50%. (For comparison, McCarthy (2000) achieves 73% on her set of hand-selected verbs, but our implementation of her method yields much lower performance on our randomly selected test verbs.) In the next section, we present background work on capturing selectional preferences in WordNet, and on using them to detect alternations. In Section 3, we describe our new SPD measure, and show how it captures both the general differences between WordNet probability distributions, as well as the fine-grained semantic distances between the nodes that comprise them. Section 4 presents our corpus methodology and experimental set-up. In Section 5, we compare SPD to a range of distance measures, and evaluate the different effects of our experimental factors, such as the precise distance functions we use in SPD and the division of our verbs into frequency bands. We summarize our findings in Section 6 and point to directions in our on-going work.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 The Use of Selectional Preferences </SectionTitle> <Paragraph position="0"> Selectional preference refers to the general notion of how much a verb favours (or disfavours) a particular noun as a semantic argument. For example, informally we would say that eat has a strong selectional preference for nouns of type food as its Theme argument. Formalization of this notion has been difficult, but several computational methods have now been proposed that capture selectional preference of a verb as a probability distribution over the WordNet hierarchy (Resnik, 1993; Li and Abe, 1998; Clark and Weir, 2002).2 The key task that each of these proposals address is how to generalize appropriately from counts of observed nouns in the relevant verb argument position (in a corpus), to a probabilistic representation of selectional strength over classes. We will refer in the remainder of the paper to such a probability distribution over WordNet as a &quot;selectional profile.&quot; As mentioned above, McCarthy (2000) suggested the use of selectional profiles to capture generalizations over argument slots, so that two argument slots could be effectively compared for detecting alternations. After extracting the argument heads of the target slots of each verb (e.g., the intransitive subject and the transitive object for the causative alternation), she then determined their selectional profiles using a minimum description length tree cut model (Li and Abe, 1998).3 The two slot profiles were compared using skew divergence (a variant of 2Resnik's proposed measure is not actually a probability distribution, but a difference between probability distributions. 3A tree cut for tree T is a set of nodes C in T such that every leaf node of T has exactly one member of C on a path between it and the root. As a selectional profile, a tree cut will have a non-zero probability associated with every node in C, and a zero probability for all other nodes in T. Figure 1 below has examples of two tree cuts.</Paragraph> <Paragraph position="1"> in square boxes, and a0a13a1a14a3a6a5a15a7a16a9a18a17 in ovals. Probability values of zero are not shown.</Paragraph> <Paragraph position="2"> KL divergence, proposed by Lee, 2001) as a probability distance measure. The value of the distance measure was compared to a threshold, which determined classification of a verb as causative (the two profiles were similar) or non-causative (the two profiles were dissimilar), leading to best performance of 73% accuracy.</Paragraph> <Paragraph position="3"> In McCarthy (2000), an error analysis reveals that the best method has more false positives than false negatives--some slots are considered overly similar because the selectional profiles are compared at a coarse-grained level, losing fine semantic distinctions.</Paragraph> <Paragraph position="4"> In the next section, we propose an alternative method of comparing selectional profiles, which addresses the problem of insufficient discrimination of profile similarity in WordNet. Furthermore, the approach applies generally to any probability distribution over WordNet, unlike McCarthy's method which is specific to profiles that are tree cuts.</Paragraph> </Section> <Section position="4" start_page="0" end_page="5" type="metho"> <SectionTitle> 3 Selectional Profile Distance </SectionTitle> <Paragraph position="0"> Our measure of selectional profile distance (SPD) is designed to meet two criteria. First, it should allow easy comparison between selectional profiles as probability scores spread throughout a hierarchical ontology (such as WordNet), not just between tree cuts. Second, it should capture fine-grained semantic similarity between profiles.</Paragraph> <Paragraph position="1"> To achieve these two goals, we measure the distance as a tree distance between the two profiles within the hierarchy, weighted by the probability scores.</Paragraph> <Paragraph position="2"> (Note that we formulate a distance measure, while referring to a component of semantic similarity. We assume throughout the paper that WordNet distance is the inverse of WordNet similarity, and indeed the similarity measures we use are directly invertible.) We illustrate with an example the differences between our measure and both McCarthy's (2000) method and general vector distance measures. Consider the two selectional profiles in Figure 1, with a0a2a1a4a3a6a5a8a7a16a9 a11 in square boxes, and a0a2a1a14a3a19a5a8a7a16a9a20a17 in ovals.4 To calculate the vector distance between a0a13a1a14a3a6a5a15a7a16a9a21a11 and a0a13a1a14a3a6a5a15a7a16a9 a17 , we need two vectors of equal dimension. In this example, one can propagate the distributions to the lowest common subsumers (i.e., B, C, and D) as in McCarthy (2000). The vectors representing the two profiles become:</Paragraph> <Paragraph position="4"> Alternately, one can also increase the dimension of each profile to include all nodes in the hierarchy (or just the union of the profile nodes). The two profiles become:</Paragraph> <Paragraph position="6"> In the first method (that of McCarthy, 2000), the two profiles become identical. By generalizing the profiles to the lowest common subsumers, we lose information about the semantic specificity of the profile nodes and can no longer distinguish the semantic distance between the nodes across profiles. In the second method, the information about the hierarchical structure (of WordNet) is lost by treating each profile as a vector of nodes. Hence, vector distance measures fail to capture any semantic similarity across different nodes (e.g., the value of node B in a0a2a1a4a3a6a5a8a7a10a9a18a11 is not directly compared to the value of its child nodes E and F in a0a2a1a4a3a6a5a8a7a10a9 a17 ).</Paragraph> <Paragraph position="7"> To remedy such shortcomings, our goal is to design a new distance measure that (i) compares the distributional differences between two profiles (somewhat similar to existing vector distances), and also (ii) captures the semantic distance between profiles. Intuitively, we can think of the profile distance as how far one profile (source) needs to &quot;travel&quot; to reach the other profile (destination). Formally, we define SPD as:</Paragraph> <Paragraph position="9"> where a100a43a101a102a3a43a103a99a104a106a105a21a107a26a108a40a109a83a110a112a111 is the portion of the profile score at node a113 in a0a13a1a14a3a6a5a15a7a16a9a64a114a116a115a28a117 that travels to node a118 in a0a2a1a4a3a6a5a8a7a10a9a21a119a121a120a28a114a28a122 , and a110a44a123a75a108a64a105a26a100a43a104a13a124a64a9a125a107a6a108a40a109a12a110a106a111 is the semantic distance between node a113 and node a118 in the hierarchy. For now, it can be assumed that a100a43a101a102a3a43a103a99a104a126a105a18a107a26a108a125a109a12a110a112a111 is a108a18a124a64a3a43a1a14a9a65a107a6a108a12a111 , the entire probability score at node a113 . Note that we design the distance to be symmetric, so that the distance remains the same regardless of which profile is source and which is destination. (We present our distance measures below.) 4Note that these are both tree cuts, so that we can compare McCarthy's method, but keep in mind that our method--as well as traditional vector distances--will apply to any probability distribution over a tree.</Paragraph> <Paragraph position="10"> In the current example, we can propagate a0a2a1a4a3a6a5a8a7a10a9a18a17 (source) to a0a13a1a14a3a6a5a15a7a16a9 a11 (destination) by moving its probabilities in this manner: 1. probabilities at nodes E and F to node B 2. probability at node G to node C 3. probability at node D to nodes H and I The first two steps are straightforward--whenever there is one destination node in a propagation path, we simply multiply the amount moved by the distance of the path (a110a43a123a75a108a121a105a28a100a44a104a2a124a64a9a65a107a26a108a40a109a83a110a112a111 ). For example, step 1 yields a contribution to a0a2a1a4a3 a107a62a0a2a1a14a3a19a5a8a7a16a9 a114 a115a26a117 a109a6a0a2a1a14a3a19a5a8a7a16a9 a119a121a120a116a114a26a122 a111 of a108a18a124a64a3a43a1a14a9a65a107a6a5 a111a18a110a43a123a75a108a121a105a18a107a7a5 a109a9a8 a111a11a10 a108a18a124a64a3a43a1a14a9a65a107a6a12a90a111a21a110a43a123a75a108a121a105a18a107a7a12a59a109a9a8 a111 . However, the last step, step 3, has multiple destination nodes (H and I), and the probability of the source node, D, must be appropriately apportioned between them. We take this into account in the a100a44a101 a3a44a103a99a104a106a105 function, by including a weight component:</Paragraph> <Paragraph position="12"> where a41a15a9a20a123a43a42a45a44a24a105a18a107a4a110a112a111 is the weight of the destination node a118 and a0a106a3a43a1a121a105a116a123a85a3a43a104 a107a26a108a12a111 is the portion of a108a18a124a4a3a44a1a14a9a65a107a26a108a12a111 that we are moving. (For this example, we continue to assume that the full amount of a108a20a124a64a3a44a1a14a9a65a107a26a108a12a111 is moved; we discuss a0a106a3a43a1a121a105a116a123a85a3a43a104 a107a26a108a12a111 further below.) The weight of each destination node a118 is calculated as the proportion of its score in the sum of the scores of its siblings. Thus, in step 1 above, a41a15a9a20a123a43a42a45a44a24a105a21a107a6a8 a111 and a41a15a9a20a123a43a42a45a44a24a105a18a107a40a46a49a111 are both 1, and the full amount of E, F, and G are moved up.</Paragraph> <Paragraph position="13"> In the last step, however, the sibling nodes H and I have to split the input from node D: node H has weight a113a48a47a50a49a45a51a53a52a24a107a25a54 a111a9a55a99a107a116a113a19a47a40a49a45a51a45a52a24a107a56a54 a111a57a10 a113a19a47a40a49a45a51a45a52a24a107a56a58a65a111a4a111a60a59a62a61a64a63 a65a66a55a99a107a56a61a2a63a67a65a60a10a68a61a64a63a70a69a83a111a71a59 a65a72a55a15a73 , and node I analogously has weight a69a53a55a53a73 .</Paragraph> <Paragraph position="14"> Hence, the SPD propagating from a0a2a1a14a3a19a5a8a7a16a9a21a17 to a0a13a1a14a3a6a5a15a7a16a9 a11 can be calculated as:</Paragraph> <Paragraph position="16"> For simplicity, we designed this example such that the two profiles are very similar. As a result, we end up propagating the entire source profile by propagating the 5We have described the algorithm as moving one profile to another. Conceptually, there are cases, as illustrated in the example, where we are propagating profile scores downwards in the hierarchy. Moving scores downwards can be computationally expensive because one may need to search through the whole subtree rooted at the source node for destination nodes.</Paragraph> <Paragraph position="17"> We implemented an alternative by moving all the scores upwards. Since we keep track of the source and destination nodes, the two methods are equivalent.</Paragraph> <Paragraph position="18"> full score of each of its nodes. In practice, for most profile comparisons, we only move the portion of the score at each node necessary to make one profile resemble the other. Hence, a0a112a3a44a1a121a105 a123a85a3a44a104 a107a6a108a12a111 in the formula for a100a43a101a102a3a43a103a99a104a126a105a18a107a26a108a125a109a12a110a112a111 in equation 2 captures the difference between probabilities at node a113 across the source and destination profiles.</Paragraph> <Paragraph position="19"> So far we have discussed very little the calculation of semantic distance between profile nodes (i.e., a110a43a123a75a108a121a105a26a100a43a104a13a124a4a9a65a107a6a108a40a109a12a110a112a111 in equation 1). Recall that one important goal in designing SPD is to capture semantic similarity between WordNet nodes. Naturally, we look to the current research comparing semantic similarity between word senses (e.g., Budanitsky and Hirst, 2001; Lin, 1998). We choose to implement two straightforward methods. For one, we invert (to obtain distance) the WordNet similarity measure of Wu and Palmer (1994), yielding:</Paragraph> <Paragraph position="21"> where a99a78a46a11a0 a107a6a104a2a11a106a109a18a104 a17 a111 is the lowest common subsumer of a100a71a101 and a100a103a102 . The other method we use is the simple edge distance between nodes, a118a23a104a93a105a107a106a50a104 .6 Thus far, we have defined SPD as a sum of propagated profile scores multiplied by the distance &quot;travelled&quot; (equation 1). We have also considered propagating other values as a function of profile scores. Let's return to the same example but redistribute some of the probability mass of a0a2a1a14a3a19a5a8a7a16a9 a17 : node E goes from a probability of 0.3 to 0.45, and node F goes from 0.2 to 0.05. As a result, the distribution of the scores at the node B subtree is more skewed towards node E than in the original a0a13a1a14a3a6a5a15a7a16a9a18a17 . For both the original and modified a0a2a1a14a3a19a5a8a7a16a9 a17 , SPD has the same value because we are moving a total probability mass of 0.5 from E and F to B, with the same semantic distance (since E and F are at the same level in the tree). However, we consider that, at the node B subtree, a0a13a1a14a3a6a5a15a7a16a9 a11 is less similar to the skewed a0a2a1a14a3a19a5a8a7a16a9a18a17 than to the original, more evenly distributed a0a13a1a14a3a6a5a15a7a16a9a18a17 . To reflect this observation, we can propagate the &quot;inverse entropy&quot; in order to capture how evenly distributed the probabilities are in a subtree. We define an alternative version of</Paragraph> <Paragraph position="23"> where we replace a0a106a3a43a1a121a105a116a123a85a3a43a104 a107a26a108a12a111 with inverse entropy, a9a18a104a106a105a116a1a14a3a4a0a116a115a19a117a119a118a19a120a106a107a26a108a83a111 , which we define as:</Paragraph> <Paragraph position="25"> 6We also implemented the WordNet edge distance measure of Leacock and Chodorow (1998). Since it did not influence our results, we omit discussion of it here.</Paragraph> <Paragraph position="26"> By propagating inverse entropy, we penalize cases where the distribution of source scores is &quot;skewed.&quot; In this work, we will experiment with both methods of propagation (with and without inverse entropy).</Paragraph> </Section> <Section position="5" start_page="5" end_page="5" type="metho"> <SectionTitle> 4 Materials and Methods </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="5" end_page="5" type="sub_section"> <SectionTitle> 4.1 Corpus Data </SectionTitle> <Paragraph position="0"> Our materials are drawn from a 6M-word corpus of medical texts, which we mined for a related project. The texts are medical journal abstracts and articles obtained by querying the PubMed Central search engine (http: //www.pubmedcentral.nih.gov/). Query terms were taken from entries listed under the &quot;Medical Encyclopedia&quot; and &quot;Drug Information&quot; sections of the MedlinePlus website (http://www.nlm.nih.gov/ medlineplus/). The text is parsed using the RASP parser (Briscoe and Carroll, 2002), and subcategorizations are extracted using the system of Briscoe and Carroll (1997). The subcategorization frame entry of each verb includes the frequency count and a list of argument heads per slot. The target slots in this work are the subject of the intransitive and the object of the transitive.</Paragraph> </Section> <Section position="2" start_page="5" end_page="5" type="sub_section"> <SectionTitle> 4.2 Verb Selection </SectionTitle> <Paragraph position="0"> We evaluate our method on the causative alternation in order for comparison to the earlier methods of McCarthy (2000) and Merlo and Stevenson (2001). We selected target verbs by choosing classes (not individual verbs) from Levin (1993) that are expected to undergo the causative alternation. We refer to these as causative verbs. For our first development set, we chose filler (non-alternating) verbs from a small set of classes that are not expected to exhibit the causative alternation. These are the restricted-class verbs. For our second development set, we did not restrict the classes of the fillers, except to avoid classes that allow a subject/object alternation as in the causative.</Paragraph> <Paragraph position="1"> These are the broad-class verbs.</Paragraph> <Paragraph position="2"> (Note that we did not hand-verify that individual verbs allowed or disallowed the alternation, as McCarthy (2000) had done, because we wanted to evaluate our method in the presence of noise of this kind.) Verbs that occur a minimum of 10 times per frame are chosen. We randomly select 36 causative verbs and 36 filler verbs for development, forming two sets of 18 causative and 18 filler verbs. The first development set uses 18 restricted-class filler verbs, and the second uses 18 broad-class filler verbs. We also randomly select 20 causative verbs and 20 broad-class verbs for testing. (The 20 filler test verbs are all drawn from the same classes as the broad-class development verbs, so that we could directly compare performance between the second development set and the test set.) Each set of verbs is further divided into a high frequency band (with at least 90 instances of one target slot), and a low frequency band (with between 20 and 80 instances of one target slot). These bands have 10 and 8 verbs, respectively, in the development sets, and equal numbers of verbs (10 each) in the test set. For each of the development and testing phases, we experiment with individual frequency bands (i.e., high band and low band, separately), and with mixed frequencies (i.e., all verbs).</Paragraph> </Section> <Section position="3" start_page="5" end_page="5" type="sub_section"> <SectionTitle> 4.3 Experimental Set-Up </SectionTitle> <Paragraph position="0"> For each verb, we extracted the argument heads of the target slots from the corpus. Using (verb,slot,noun) frequencies, we experimented with several ways of building selectional profiles of each verb's argument slot (Resnik, 1993; Li and Abe, 1998; Clark and Weir, 2002).7 In our development work, we found that the method of Clark and Weir (2002) overall gave better performance, and so we limit our discussion here to the results on their model.</Paragraph> <Paragraph position="1"> It is worth noting that the method of Clark and Weir (2002) does not yield a tree cut, but instead generally populates the WordNet hierarchy with non-zero probabilities.</Paragraph> <Paragraph position="2"> This means that the kind of straightforward propagation method used by McCarthy (2000) is not applicable to selectional profiles of this type.</Paragraph> <Paragraph position="3"> We compare SPD to a number of other measures, applied directly to the (unpropagated) probability profiles given by the Clark-Weir method: the probability distribution distances given by skew divergence (skew) and Jensen-Shannon divergence (JS) (Lee, 2001), as well as the general vector distances of cosine (cos), Manhattan distance (L1 norm), and euclidean distance (L2 norm).</Paragraph> <Paragraph position="4"> To determine whether a verb participates in the causative alternation, we adopt McCarthy's method of using a threshold over the calculated distance measures, testing both the mean and median distances as possible thresholds. In our case, verbs with slot-distances below the threshold (smaller distances) are classified as causative, and those above the threshold as non-causative.</Paragraph> <Paragraph position="5"> Accuracy is used as the performance measure.</Paragraph> </Section> </Section> class="xml-element"></Paper>