File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-0410_metho.xml
Size: 31,797 bytes
Last Modified: 2025-10-06 14:08:24
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-0410"> <Title>Semi-supervised Verb Class Discovery Using Noisy Features</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 The Feature Space </SectionTitle> <Paragraph position="0"> Like others, we have assumed lexical semantic classes of verbs as defined in Levin (1993) (hereafter Levin), which have served as a gold standard in computational linguistics research (Dorr and Jones, 1996; Kipper et al., 2000; Merlo and Stevenson, 2001; Schulte im Walde and Brew, 2002). Levin's classes form a hierarchy of verb groupings with shared meaning and syntax. Our feature space was designed to reflect these classes by capturing properties of the semantic arguments of verbs and their mapping to syntactic positions. It is important to emphasize, however, that our features are extracted from part-of-speech (POS) tagged and chunked text only: there are no semantic tags of any kind. Thus, the features serve as approximations to the underlying distinctions among classes.</Paragraph> <Paragraph position="1"> Here we briefly describe the features that comprise our feature space, and refer the interested reader to Joanis and Stevenson (2003) for details.</Paragraph> <Paragraph position="2"> Features over Syntactic Slots (120 features) One set of features encodes the frequency of the syntactic slots occurring with a verb (subject, direct and indirect object, and prepositional phrases (PPs) indexed by preposition), which collectively serve as rough approximations to the allowable syntactic frames for a verb. We also count fixed elements in certain slots (it and there, as in It rains or There appeared a ship), since these are part of the syntactic frame specifications for a verb.</Paragraph> <Paragraph position="3"> In addition to approximating the syntactic frames themselves, we also want to capture regularities in the mapping of arguments to particular slots. For example, the location argument, the truck, is direct object in I loaded the truck with hay, and object of a preposition in I loaded hay onto the truck. These allowable alternations in the expressions of arguments vary according to the class of a verb. We measure this behaviour using features that encode the degree to which two slots contain the same entities--that is, we calculate the overlap in noun (lemma) usage between pairs of syntactic slots.</Paragraph> <Paragraph position="4"> Tense, Voice, and Aspect Features (24 features) Verb meaning, and therefore class membership, interacts in interesting ways with voice, tense, and aspect (Levin, 1993; Merlo and Stevenson, 2001). In addition to verb POS (which often indicates tense) and voice (passive/active), we also include counts of modals, auxiliaries, and adverbs, which are partial indicators of these factors.</Paragraph> <Paragraph position="5"> The Animacy Features (76 features) Semantic properties of the arguments that fill certain roles, such as animacy or motion, are more challenging to detect automatically. Currently, our only such feature is an extension of the animacy feature of Merlo and Stevenson (2001). We approximate the animacy of each of the 76 syntactic slots by counting both pronouns and proper noun phrases (NPs) labelled as &quot;person&quot; by our chunker (Abney, 1991).</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Experimental Classes and Verbs </SectionTitle> <Paragraph position="0"> We use the same classes and example verbs as in the supervised experiments of Joanis and Stevenson (2003) to enable a comparison between the performance of the unsupervised and supervised methods. Here we describe the selection of the experimental classes and verbs, and the estimation of the feature values.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 The Verb Classes </SectionTitle> <Paragraph position="0"> Pairs or triples of verb classes from Levin were selected to form the test pairs/triples for each of a number of separate classification tasks. These sets exhibit different contrasts between verb classes in terms of their semantic argument assignments, allowing us to evaluate our approach under a range of conditions. For example, some classes differ in both their semantic roles and frames, while others have the same roles in different frames, or different roles in the same frames.1 Here we summarize the argument structure distinctions between the classes; Table 1 below lists the classes with their Levin class numbers.</Paragraph> <Paragraph position="1"> Benefactive versus Recipient verbs.</Paragraph> <Paragraph position="2"> Mary baked... a cake for Joan/Joan a cake.</Paragraph> <Paragraph position="3"> Mary gave... a cake to Joan/Joan a cake.</Paragraph> <Paragraph position="4"> These dative alternation verbs differ in the preposition and the semantic role of its object.</Paragraph> <Paragraph position="5"> general conclusions from the results, the classes also could neither be too small nor contain mostly infrequent verbs. Admire versus Amuse verbs.</Paragraph> <Paragraph position="6"> I admire Jane. Jane amuses me.</Paragraph> <Paragraph position="7"> These psychological state verbs differ in that the Experiencer argument is the subject of Admire verbs, and the object of Amuse verbs.</Paragraph> <Paragraph position="8"> Run versus Sound Emission verbs.</Paragraph> <Paragraph position="9"> The kids ran in the room./*The room ran with kids.</Paragraph> <Paragraph position="10"> The birds sang in the trees./The trees sang with birds. These activity verbs both have an Agent subject in the intransitive, but differ in the prepositional alternations they allow.</Paragraph> <Paragraph position="11"> Cheat versus Steal and Remove verbs.</Paragraph> <Paragraph position="12"> I cheated... Jane of her money/*the money from Jane. I stole... *Jane of her money/the money from Jane.</Paragraph> <Paragraph position="13"> These classes also assign the same semantic arguments, but differ in their prepositional alternants.</Paragraph> <Paragraph position="14"> Wipe versus Steal and Remove verbs.</Paragraph> <Paragraph position="15"> Wipe... the dust/the dust from the table/the table.</Paragraph> <Paragraph position="16"> Steal... the money/the money from the bank/*the bank. These classes generally allow the same syntactic frames, but differ in the possible semantic role assignment. (Location can be the direct object of Wipe verbs but not of Steal and Remove verbs, as shown.) Spray/Load versus Fill versus Other Verbs of Putting (several related Levin classes).</Paragraph> <Paragraph position="17"> I loaded... hay on the wagon/the wagon with hay.</Paragraph> <Paragraph position="18"> I filled... *hay on the wagon/the wagon with hay.</Paragraph> <Paragraph position="19"> I put... hay on the wagon/*the wagon with hay.</Paragraph> <Paragraph position="20"> These three classes also assign the same semantic roles but differ in prepositional alternants. Note, however, that the options for Spray/Load verbs overlap with those of the other two types of verbs.</Paragraph> <Paragraph position="21"> Optionally Intransitive: Run versus Change of State versus &quot;Object Drop&quot;.</Paragraph> <Paragraph position="22"> The horse raced./The jockey raced the horse.</Paragraph> <Paragraph position="23"> The butter melted./The cook melted the butter.</Paragraph> <Paragraph position="24"> The boy played./The boy played soccer.</Paragraph> <Paragraph position="25"> These three classes are all optionally intransitive but assign different semantic roles to their arguments (Merlo and Stevenson, 2001). (Note that the Object Drop verbs are a superset of the Benefactives above.) For many tasks, knowing exactly what PP arguments each verb takes may be sufficient to perform the classification (cf. Dorr and Jones, 1996). However, our features do not give us such perfect knowledge, since PP arguments and adjuncts cannot be distinguished with high accuracy. Using our simple extraction tools, for example, the PPa0a2a1a4a3 argument in I admired Jane for her honesty is not distinguished from the PPa0a2a1a4a3 adjunct in I amused Jane for the money. Furthermore, PP arguments differ in frequency, so that a highly distinguishing but rarely used alternant will likely not be useful. Indicators of PP usage are thus useful but not definitive.</Paragraph> <Paragraph position="26"> numbers, and the number of experimental verbs in each (see Section 3.2).</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Verb Selection </SectionTitle> <Paragraph position="0"> Our experimental verbs were selected as follows. We started with a list of all the verbs in the given classes from Levin, removing any verb that did not occur at least 100 times in our corpus (the BNC, described below). Because we make the simplifying assumption of a single correct classification for each verb, we also removed any verb: that was deemed excessively polysemous; that belonged to another class under consideration in our study; or for which the class did not correspond to the main sense.</Paragraph> <Paragraph position="1"> Table 1 above shows the number of verbs in each class at the end of this process. Of these verbs, 20 from each class were randomly selected to use as trainingdata for our supervised experiments in Joanis and Stevenson (2003).</Paragraph> <Paragraph position="2"> We began with this same set of 20 verbs per class for our current work. We then replaced 10 of the 260 verbs (4%) to enable us to have representative seed verbs for certain classes in our semi-supervised experiments (e.g., so that we could include wipe as a seed verb for the Wipe verbs, and fill for the Fill verbs). All experiments reported here were run on this same final set of 20 verbs per class (including a replication of our earlier supervised experiments). null</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.3 Feature Extraction </SectionTitle> <Paragraph position="0"> All features were estimated from counts over the British National Corpus (BNC), a 100M word corpus of text samples of recent British English ranging over a wide spectrum of domains. Since it is a general corpus, we do not expect any strong overall domain bias in verb usage.</Paragraph> <Paragraph position="1"> We used the chunker (partial parser) of Abney (1991) to preprocess the corpus, which (noisily) determines the NP subject and direct object of a verb, as well as the PPs potentially associated with it. Indirect objects are identified by a less sophisticated (and even noisier) method, simply assuming that two consecutive NPs after the verb constitute a double object frame. From these extracted slots, we calculate the features described in Section 2, yielding a vector of 220 normalized counts for each verb, which forms the input to our machine learning experiments.</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Clustering and Evaluation Methods </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Clustering Parameters </SectionTitle> <Paragraph position="0"> We used the hierarchical clustering command in Matlab, which implements bottom-up agglomerative clustering, for all our unsupervised experiments. In performing hierarchical clustering, both a vector distance measure and a cluster distance (&quot;linkage&quot;) measure are specified. We used the simple Euclidean distance for the former, and Ward linkage for the latter. Ward linkage essentially minimizes the distances of all cluster points to the centroid, and thus is less sensitive to outliers than some other methods.</Paragraph> <Paragraph position="1"> We chose hierarchical clustering because it may be possible to find coherent subclusters of verbs even when there are not exactly a0 good clusters, where a0 is the number of classes. To explore this, we can induce any number of clusters a1 by making a cut at a particular level in the clustering hierarchy. In the experiments here, however, we report only results for a1a3a2a4a0 , since we found no principled way of automatically determining a good cutoff. However, we did experiment with a1a5a2a7a6a8a0 (as in Strehl et al., 2000), and found that performance was generally better (even on our a9a11a10a13a12a15a14 measure, described below, that discounts oversplitting). This supports our intuition that the approach may enable us to find more consistent clusters at a finer grain, without too much fragmentation.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Evaluation Measures </SectionTitle> <Paragraph position="0"> We use three separate evaluation measures, that tap into very different properties of the clusterings.</Paragraph> <Paragraph position="1"> We can assign each cluster the class label of the majority of its members. Then for all verbs a16 , consider a16 to be classified correctly if Class(a16 )=ClusterLabel(a16 ), where Class(a16 ) is the actual class of a16 and ClusterLabel(a16 ) is the label assigned to the cluster in which a16 is placed. Then accuracy has the standard definition:2 2a17a19a18a20a18 is equivalent to the weighted mean precision of the clusters, weighted according to cluster size.</Paragraph> <Paragraph position="2"> As we have defined it, a17a21a18a20a18 necessarily generally increases as the number of clusters increases, with the extreme being at the number of clusters equal to the number of verbs. However, since we fix our number of clusters to the number of classes, the measure remains informative.</Paragraph> <Paragraph position="4"> #verbs correctly classified #verbs total a22a24a23a25a23 thus provides a measure of the usefulness in practice of a clustering--that is, if one were to use the clustering as a classification, this measure tells how accurate overall the class assignments would be. The theoretical maximum is, of course, 1. To calculate a random baseline, we evaluated 10,000 random clusterings with the same number of verbs and classes as in each of our experimental tasks. Because the a22a26a23a13a23 achieved depends on the precise size of clusters, we calculated mean a22a24a23a13a23 over the best scenario (with equal-sized clusters), yielding a conservative estimate (i.e., an upper bound) of the baseline. These figures are reported with our results in Table 2 below.</Paragraph> <Paragraph position="5"> Accuracy can be relatively high for a clustering when a few clusters are very good, and others are not good.</Paragraph> <Paragraph position="6"> Our second measure, the adjusted Rand measure used by Schulte im Walde (2003), instead gives a measure of how consistent the given clustering is overall with respect to the gold standard classification. The formula is as follows a14 is the entry in the contingency table between the classification and the clustering, counting the size of the intersection of class a60 and cluster a61 . Intuitively, a9 a10a13a12a15a14 measures the similarity of two partitions of data by considering agreements and disagreements between them-there is agreement, for example, if a16 and a16 a14 from the same class are in the same cluster, and disagreement if they are not. It is scaled so that perfect agreement yields a value of 1, whereas random groupings (with the same number of groups in each) get a value around 0. It is therefore considered &quot;corrected for chance,&quot; given a fixed number of clusters.3 null In tests of the a9a62a10a13a12a15a14 measure on some contrived clusterings, we found it quite conservative, and on our experimental clusterings it did not often attain values higher than .25. However, it is useful as a relative measure of goodness, in comparing clusterings arising from different feature sets.</Paragraph> <Paragraph position="7"> a22a24a23a25a23 gives an average of the individual goodness of the clusters, and a9a62a10a13a12a15a14 a measure of the overall goodness, both with respect to the gold standard classes. Our final measure gives an indicationof the overall goodness of the clus- null 2-way Wipe/Steal-Remove task, using the Ling and Seed sets. The higher a0a2a1a4a3 a59 a5 a60a8a6 (.89 vs. .33) reflects the better separation of the data.</Paragraph> <Paragraph position="8"> regard to the target classes. We use a0a2a1a4a3 a59 a5 a60a7a6 , the mean of the silhouette measure from Matlab, which measures how distant a data point is from other clusters. Silhouette values vary from +1 to -1, with +1 indicating that the point is near the centroid of its own cluster, and -1 indicating that the point is very close to another cluster (and therefore likely in the wrong cluster). A value of 0 suggests that a point is not clearly in a particular cluster.</Paragraph> <Paragraph position="9"> We calculate the mean silhouette of all points in a clustering to obtain an overall measure of how well the clusters are separated. Essentially, the measure numerically captures what we can intuitivelygrasp in the visual differences between the dendrograms of &quot;better&quot; and &quot;worse&quot; clusterings. (A dendrogram is a tree diagram whose leaves are the data points, and whose branch lengths indicate similarity of subclusters; roughly, shorter vertical lines indicate closer clusters.) For example, Figure 1 shows two dendrograms using different feature sets (Ling and Seed, described in Section 5) for the same set of verbs from two classes. The Seed set has slightly lower values for a22a26a23a13a23 and a9 a10a25a12a15a14 , but a much higher value (.89) for a0a9a1a10a3 a59 a5 a60a7a6 , indicating a better separation of the data. This captures what is reflected in the dendrogram, in that very short lines connect verbs low in the tree, and longer lines connect the two main clusters. The a0a2a1a4a3 a59 a5 a60a7a6 measure is independent of the true classification, and could be high when the other dependent measures are low, or vice versa. However, it gives important information about the quality of a clustering: The other measures being equal, a clustering with a higher a0a2a1a4a3 a59 a5 a60a8a6 value indicates tighter and more separated clusters, suggesting stronger inherent patterns in the data.</Paragraph> </Section> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 Experimental Results </SectionTitle> <Paragraph position="0"> We report here the results of a number of clustering experiments, using feature sets as follows: (1) the full feature space; (2) a manually selected subset of features; (3) unsupervised selection of features; and (4) semi-supervised selection, using a supervised learner applied to seed verbs to select the features.</Paragraph> <Paragraph position="1"> For each type of feature set, we performed the same ten clustering tasks, shown in the first column of Table 2. These are the same tasks performed in the supervised setting of Joanis and Stevenson (2003). The 2- and 3-way tasks, and their motivation, were described in Section 3.1.</Paragraph> <Paragraph position="2"> Three multiway tasks explore performance over a larger number of classes: The 6-way task involves the Cheat, Steal-Remove, Wipe, Spray/Load, Fill, and &quot;Other Verbs of Putting&quot; classes, all of which undergo similar locative alternations. To these 6, the 8-way task adds the Run and Sound Emission verbs, which also undergo locative alternations. The 13-way task includes all of our classes.</Paragraph> <Paragraph position="3"> The second column of Table 2 includes the accuracy of our supervised learner (the decision tree induction system, C5.0), on the same verb sets as in our clustering experiments. These are the results of a 10-fold cross-validation (with boosting) repeated 50 times.4 In our earlier work, we found that cross-validation performance averaged about .02, .04, and .11 higher than test performance on the 2-way, 3-way, and multiway tasks, respectively, and so should be taken as an upper bound on what can be achieved.</Paragraph> <Paragraph position="4"> The third column of Table 2 gives the baseline a22a24a23a25a23 we calculated from random clusterings. Recall that this is an upper bound on random performance. We use this base-line in calculating reductions in error rate of a22a26a23a13a23 . The remaining columns of the table give the a22a26a23a13a23 , a9 a10a13a12a15a14 , and a0a2a1a4a3 a59 a60a7a6 measures as described in Section 4.2, for each of the feature sets we explored in clustering, which we discuss in turn below.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.1 Full Feature Set </SectionTitle> <Paragraph position="0"> The first subcolumn (Full) under each of the three clustering evaluation measures in Table 2 shows the results using the fullset of features (i.e., no feature selection). Although generally higher than the baseline, a22a24a23a13a23 is well below that of the supervised learner, and a9 a10a13a12a14 and a0a2a1a4a3 a59 a5 a60a8a6 are generally low.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.2 Manual Feature Selection </SectionTitle> <Paragraph position="0"> One approach to dimensionality reduction is to handselect features that one believes to be relevant to a given task. Following Joanis and Stevenson (2003), for each class, we systematically identified the subset of features set; Ling is manually selected subset; Seed is seed-verb-selected set. See text for further description. indicated by the class description given in Levin. For each task, then, the linguistically-relevant subset is defined as the union of these subsets for all the classes in the task.</Paragraph> <Paragraph position="1"> The results for these feature sets in clustering are given a60a8a6 measures in Table 2. On the 2-way tasks, the performance on average is very close to that of the full feature set for the a22a26a23a13a23 and a9 a10a13a12a15a14 measures. On the 3-way and multiway tasks, there is a larger performance gain using the subset of features, with an increase in the reduction of the error rate (over Base a22a24a23a25a23 ) of 6-7% over the full feature set.</Paragraph> <Paragraph position="2"> Overall, there is a small performance gain using the Ling subset of features (with an increase in error rate reduction from 13% to 17%). Moreover, the a0a2a1a4a3 a59 a5 a60a7a6 value for the manually selected features is almost always very much higher than that of the full feature set, indicating that the subset of features is more focused on the properties that lead to a better separation of the data.</Paragraph> <Paragraph position="3"> This performance comparison tentatively suggests that good feature selection can be helpful in our task. However, it is important to find a method that does not depend on having an existing classification, since we are interested in applying the approach when such a classification does not exist. In the next two sections, we present unsupervised and minimally supervised approaches to this problem.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.3 Unsupervised Feature Selection </SectionTitle> <Paragraph position="0"> In order to deal with excessive dimensionality, Dash et al.</Paragraph> <Paragraph position="1"> (1997) propose an unsupervised method to rank a set of features according to their ability to organize the data in space, based on an entropy measure they devise. Unfortunately, this promising method did not prove practical for our data. We performed a number of experiments in which we tested the performance of each feature set from cardinality 1 to the total number of features, where each set of size a60 differs from the set of size a60 a41a1a0 in the addition of the feature with next highest rank (according to the proposed entropy measure). Many feature sets performed very well, and some far outperformed our best results using other feature selection methods. However, across our 10 experimental tasks, there was no consistent range of feature ranks or feature set sizes that was correlated with good performance. While we could have selected a threshold that might work reasonably well with our data, we would have little confidence that it would work well in general, considering the inconsistent pattern of results.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.4 Semi-Supervised Feature Selection </SectionTitle> <Paragraph position="0"> Unsupervised methods such as Dash et al.'s (1997) are appealing because they require no knowledge external to the data. However, in many aspects of computational linguistics, it has been found that a small amount of labelled data contains sufficient information to allow us to go beyond the limits of completely unsupervised approaches. In our domain in particular, verb class discovery &quot;in a vacuum&quot; is not necessary. A plausible scenario is that researchers would have examples of verbs which they believe fall into different classes of interest, and they want to separate other verbs along the same lines. To model this kind of approach, we selected a sample of five seed verbs from each class. Each set of verbs was judged (by the authors' intuition alone) to be &quot;representative&quot; of the class. We purposely did not carry out any linguistic analysis, although we did check that each verb was reasonably frequent (with log frequencies ranging from 2.6 to 5.1).</Paragraph> <Paragraph position="1"> For each experimental task, we ran our supervised from the resulting decision trees the union of all features used, which formed the reduced feature set for that task.</Paragraph> <Paragraph position="2"> Each clustering experiment used the full set of 20 verbs per class; i.e., seed verbs were included, following our proposed model of guided verb class discovery.5 The results using these feature sets are shown in the third subcolumn (Seed) under our three evaluation measures in Table 2. This feature selection method is highly successful, outperformingthe full feature set (Full) ona22a24a23a13a23 and a9 a10a25a12a15a14 on most tasks, and performing the same or very close on the remainder. Moreover, the seed set of features outperforms the manually selected set (Ling) on over half the tasks. More importantly, the Seed set shows a mean overall reduction in error rate (over Base a22a26a23a13a23 ) of 28%, compared to 17% for the Ling set. The increased reduction in error rate is particularly striking for the 2-way tasks, of 37% for the Seed set compared to 20% for the Ling set.</Paragraph> <Paragraph position="3"> Another striking result is the difference in a0a9a1a10a3 a59 a5 a60a8a6 values, which are very much higher than those for Ling (which are in turn much higher than for Full). Thus, not only do we see a sizeable increase in performance, we also obtain tighter and better separated clusters with our proposed feature selection approach.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.5 Further Discussion </SectionTitle> <Paragraph position="0"> In our clustering experiments, we find that smaller sub-sets of features generally perform better than the full set of features. (See Table 3 for the number of features in the Ling and Seed sets.) However, not just any small set of features is adequate. We ran 50 experiments using randomly selected sets of features of cardinality a0 a0 , where a0 5We also tried directly applying the mutual information (MI) measure used in decision-tree induction (Quinlan, 1986). We calculated the MI of each feature with respect to the classification of the seed verbs, and computed clusterings using the features above a certain MI threshold. This method did not work as well as running C5.0, which presumably captures important feature interactions that are ignored in the individual MI calculations.</Paragraph> <Paragraph position="1"> is the number of classes (a simple linear function roughly approximating the number of features in the Seed sets).</Paragraph> <Paragraph position="2"> Mean a22a26a23a13a23 over these clusterings was much lower than for the Seed sets, and a9a62a10a13a12a14 was extremely low (below .08 in all cases). Interestingly, a0a2a1a4a3 a59 a5 a60a7a6 was generally very high, indicating that there is structure in the data, but not what matches our classification. This confirms that appropriate feature selection, and not just a small number of features, is important for the task of verb class discovery.</Paragraph> <Paragraph position="3"> We also find that our semi-supervised method (Seed) is linguistically plausible, and performs as well as or better than features manually determined based on linguistic knowledge (Ling). We might also ask, would any sub-set of verbs do as well? To answer this, we ran experiments using 50 different randomly selected seed verb sets for each class. We found that the mean a22a24a23a13a23 and a0a2a1a4a3 a59 a5 a60a7a6 values are the same as that of the Seed set reported above, but mean a9a62a10a13a12a15a14 is a little lower. We tentatively conclude that, yes, any subset of verbs of the appropriate class may be sufficient as a seed set, although some sets are better than others. This is promising for our method, as it shows that the precise selection of a seed set of verbs is not crucial to the success of the semi-supervised approach.</Paragraph> </Section> </Section> <Section position="7" start_page="0" end_page="0" type="metho"> <SectionTitle> 6 Other Verb Clustering Work </SectionTitle> <Paragraph position="0"> Using the same a22a24a23a25a23 measure as ours, Stevenson and Merlo (1999) achieved performance in clustering very close to that of their supervised classification. However, their study used a small set of five features manually devised for a set of three particular classes. Our feature set is essentially a generalization of theirs, but in scaling up the feature space to be useful across English verb classes in general, we necessarily face a dimensionalityproblem that did not arise in their research.</Paragraph> <Paragraph position="1"> Schulte im Walde and Brew (2002) and Schulte im Walde (2003), on the other hand, use a larger set of features intended to be useful for a broad number of classes, as in our work. The a9 a10a13a12a15a14 scores of Schulte im Walde (2003) range from .09 to .18, while ours range from .02 to .34, with a mean of .17 across all tasks. However, Schulte im Walde's features rely on accurate subcategorization statistics, and her experiments include a much larger set of classes (around 40), each with a much smaller number of verbs (average around 4). Performance differences may be due to the types of features (ours are noisier, but capture information beyond subcat), or due to the number or size of classes. While our a9a62a10a13a12a15a14 results generally decrease with an increase in the number of classes, indicating that our tasks in general may be &quot;easier&quot; than her 40-way distinction, our classes also have many more members (20 versus an average of 4) that need to be grouped together. It is a question for future research to explore the effect of these variables in clustering performance.</Paragraph> </Section> class="xml-element"></Paper>