File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/w02-1016_metho.xml

Size: 20,987 bytes

Last Modified: 2025-10-06 14:08:02

<?xml version="1.0" standalone="yes"?>
<Paper uid="W02-1016">
  <Title>Spectral Clustering for German Verbs</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Verb valency description
</SectionTitle>
    <Paragraph position="0"> The data in question come from a subcategorization lexicon induced from a large German newspaper corpus (Schulte im Walde, 2002). The verb valency information is provided in form of probability distributions over verb frames for each verb. There are two conditions: the rst with 38 relatively coarse syntactic verb subcategorisation frames, the second a more delicate classi cation subdividing the verb frames of the rst condition using prepositional phrase information (case plus preposition), resulting in 171 possible frame types.</Paragraph>
    <Paragraph position="1"> The verb frame types contain at most three arguments. Possible arguments in the frames are nominative (n), dative (d) and accusative (a) noun phrases, re exive pronouns (r), prepositional phrases (p), expletive es (x), non- nite clauses (i), nite clauses (s-2 for verb second clauses, s-dass for dass-clauses, s-ob for obclauses, s-w for indirect wh-questions), and cop-Association for Computational Linguistics.</Paragraph>
    <Paragraph position="2"> Language Processing (EMNLP), Philadelphia, July 2002, pp. 117-124. Proceedings of the Conference on Empirical Methods in Natural ula constructions (k). For example, subcategorising a direct (accusative case) object and a non- nite clause would be represented by nai. Table 1 shows an example distribution for the verb glauben 'to think/believe'. The more delicate version of subcategorisation frame was done by distributing the frequency mass of prepositional phrase frame types (np, nap, ndp, npr, xp) over the prepositional phrases, according to their frequencies in the corpus. Prepositional phrases are referred to by case and preposition, such as 'Dat.mit', 'Akk.f ur'. The present work uses the latter, more delicate, verb valency descriptions.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Problems with standard clustering
</SectionTitle>
    <Paragraph position="0"> Our previous work on the valency data applied k-Means (a standard technique) to the task of inducing semantic classes for German verbs (Schulte im Walde and Brew, 2002). We compared the results of k-Means clustering with a gold standard set prepared according to the principles of verb classi cation advocated by (Levin, 1993), and reported on the sensitivity of the classes to linguistically motivated &amp;quot;lesioning&amp;quot; of the input verb frame. The verb classes we used are listed in Table 2.</Paragraph>
    <Paragraph position="1"> The verb classes are closely related to Levin's English classes. They also agree with the German verb classi cation in (Schumacher, 1986) as far as the relevant verbs appear in his less extensive semantic ' elds'. The rough glosses and the references to Levin's classes in the table are primarily to aid the intuition of non-native speakers of German.</Paragraph>
    <Paragraph position="2"> Clustering can be thought of as a process that nds a discrete approximation to a distance measure. For any data set of n items over which a distance measure is de ned, the Gram matrix is the symmetric n-by-n matrix whose elements Mij are the distances between items i and j.</Paragraph>
    <Paragraph position="3"> The diagonal elements Mii of this matrix will all be 0. Every clustering corresponds to a blockdiagonal Gram matrix. Clustering n items into k classes corresponds to the choice of an ordering for the labels on the axes of the Gram matrix and the choice of k 1 change points marking the boundaries between the blocks. Thus, the search space of clusters is very large. The available techniques for searching this large space do not (and probably cannot) o er guarantees of global optimality. Standard nostrums include transformations of the underlying data, and the deployment of di erent strategies for initializing the cluster centers. These may produce intuitively attractive clusters, but when we apply these ideas to our verb frame data many questions remain, including the following: When the solutions found by clustering differ from our intuitions, is this because of failures in the features used, the clustering techniques, or the intuitions? How close are the local optima found by the clustering techniques to the best solutions in the space de ned by the data? Is it even appropriate to use frequency information for this problem? Or would it su ce to characterize verb classes by the pattern of frames that their members can inhabit, without regard to frequency? Does the data support some clusters more strongly than others? Are all the distinctions made in classi cations such as Levin's of equal validity? In response to these questions, the present paper describes an application to the verb data of a particular spectral clustering technique (Ng et al., 2002). At the heart of this approach is a transformation of the original verb frame data into a set of orthogonal eigenvectors. We work Aspect 55.1 anfangen, aufh oren, beenden, beginnen, enden start, stop, bring to an end, begin, end Propositional Attitude 29.5 ahnen, denken, glauben, vermuten, wissen sense, think, think, guess, know Transfer of Possession 29.5 bekommen, erhalten, erlangen, kriegen (Obtaining) receive, obtain, acquire, get Transfer of Possession 11.1/3 bringen, liefern, schicken, vermitteln, zustellen (Supply) bring, deliver, send, procure, deliver Manner of Motion 51.4.2 fahren, iegen, rudern, segeln drive, y, row, sail Emotion 31.1 argern, freuen irritate, delight Announcement 37.7 ank undigen, bekanntgeben, er o nen, verk unden announce, make known, disclose, proclaim Description 29.2 beschreiben, charakterisieren, darstellen, interpretieren describe, characterise, present, interpret Insistence - beharren, bestehen, insistieren, pochen all mean insist Position 50 liegen, sitzen, stehen lie, sit, stand Support - dienen, folgen, helfen, unterst utzen serve, follow, help, support Opening 45.4 o nen, schlie en open, close Consumption 39.4 essen, konsumieren, lesen, saufen, trinken eat, consume, read, drink (esp. animals or drunkards), drink Weather 57 blitzen, donnern, d ammern, nieseln, regnen, schneien ash, thunder, dawn/grow dusky, drizzle, rain, snow  in the space de ned by the rst few eigenvectors, using standard clustering techniques in the transformed space.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="2" type="metho">
    <SectionTitle>
4 The spectral clustering algorithm
</SectionTitle>
    <Paragraph position="0"> The spectral clustering algorithm takes as input a matrix formed from a pairwise similarity function over a set of data points. In image segmentation two pixels might be declared similar if they have similar intensity, similar hue or similar location, or if a local edge-detection algorithm does not place an edge between them.</Paragraph>
    <Paragraph position="1"> The technique is generic, and as (Longuet-Higgins and Scott, 1990) point out, originated not in computer science or AI but in molecular physics. Most of the authors nevertheless \adopt the terminology of image segmentation (i.e. the data points are pixels and the set of pixels is the image), keeping in mind that all the results are also valid for similarity-based clustering&amp;quot; (Meil a and Shi, 2001). Our natural language application of the technique uses straight-forward similarity measures based on verb frame statistics, but nothing in the algorithm hinges on this, and we plan in future work to elaborate our similarity measures. Although there are several roughly analogous spectral clustering techniques in the recent literature (Meil a and Shi, 2001; Longuet-Higgins and Scott, 1990; Weiss, 1999), we use the algorithm from (Ng et al., 2002) because it is simple to implement and understand.</Paragraph>
    <Paragraph position="2"> Here are the key steps of that algorithm: Given a set of points S =fs1;:::;sngin a high dimensional space.</Paragraph>
    <Paragraph position="3"> 1. Form a distance matrix D 2 R2. For (Ng et al., 2002) this distance measure is Euclidean, but other measures also make sense.</Paragraph>
    <Paragraph position="4"> 2. Transform the distance matrix to an a nity matrix by Aij = exp( D</Paragraph>
    <Paragraph position="6"> the rate at which a nity drops o with distance. null 3. Form the diagonal matrix D whose (i,i) element is the sum of A's ith row, and create the matrix L = D 1=2AD 1=2.</Paragraph>
    <Paragraph position="7"> 4. Obtain the eigenvectors and eigenvalues of L.</Paragraph>
    <Paragraph position="8"> 5. Form a new matrix from the vectors associated with the k largest eigenvalues. Choose k either by stipulation or by picking su cient eigenvectors to cover 95% of the variance1. null 6. Each item now has a vector of k co-ordinates in the transformed space. Normalize these vectors to unit length. 7. Cluster in k-dimensional space. Following  (Ng et al., 2002) we use k-Means for this purpose, but any other algorithm that produces tight clusters could ll the same role. In (Ng et al., 2002) an analysis demonstrates that there are likely to be k wellseparated clusters.</Paragraph>
    <Paragraph position="9"> We carry out the whole procedure for a range of values of . In our experiments is searched in steps of 0.001 from 0.01 to 0.059, since that always su ced to nd the best aligned set of clusters. If is set too low no useful eigenvectors are returned, but this situation is easy to detect. We take the solution with the best alignment (see de nition below) to the (original) distance measure. This is how (Christianini et al., 2002) choose the best solution, while (Ng et al., 2002) explain that they choose the solution with the tightest clusters, without being speci c on how this is done.</Paragraph>
    <Paragraph position="10"> In general it matters how initialization of cluster centers is done for algorithms like k-Means. (Ng et al., 2002) provide a neat initialization strategy, based on the expectation that the clusters in their space will be orthogonal. They select the rst cluster center to be a randomly chosen data point, then search the remaining data points for the one most orthogonal to that. For the third data point they look for one that is most orthogonal to the previous two, and so on until su cient have been obtained. We modify this strategy slightly, removing the random component by initializing n times, starting out at each data point in turn. This is fairly costly, but improves results, and is less expensive than the random initializations</Paragraph>
  </Section>
  <Section position="6" start_page="2" end_page="2" type="metho">
    <SectionTitle>
5 Experiments and evaluation
</SectionTitle>
    <Paragraph position="0"> We clustered the verb frames data using our version of the algorithm in (Ng et al., 2002). To calculate the distance d between two verbs v1 and v2 we used a range of measures: the cosine of the angle between the two vectors of frame probabilities, a attened version of the cosine measure in which all non-zero counts are replaced by 1.0 (labelled bcos, for binarized cosine, in Table 3), and skew divergence, recently shown as an e ective measure for distributional similarity (Lee, 2001). This last is de ned in terms of KL-divergence, and includes a free weight parameter w, which we set to 0.9, following(Lee, 2001), Skew-divergence is asymmetric in its arguments, but our technique needs a symmetric measure,so we calculate it in both directions and use the larger value.</Paragraph>
    <Paragraph position="1"> Table 3 contains four results for each of three distance measures (cos,bcos and skew). The rst line of each set gives the results when the spectral algorithm is provided with the prior knowledge that k = 14. The second line gives the results when the standard k-Means algorithm is used, again with k = 14. In the third line of each set, the value of k is determined from the eigenvalues, as described above. For cos 12 clusters are chosen, for bcos the chosen value is 17, and for skew it is 16. The nal line of each set gives the results when the standard algorithm is used, but k is set to the value selected for that distance measure by the spectral method.</Paragraph>
    <Paragraph position="2"> For standard k-Means, the initialization strategy from (Ng et al., 2002) does not apply (and does not work well in any case), so we used 100 random replications of the initialization, each time initializing the cluster centers with k randomly chosen data points. We report the result that had the highest alignment with the distance measure (cf. Section 5.1).</Paragraph>
    <Paragraph position="3"> (Meil a and Shi, 2001) provide analysis indicating that their MNcut algorithm (another spectral clustering technique) will be exact when the eigenvectors used for clustering are piecewise constant. Figure 1 shows the top 16 eigenvectors of a distance matrix based on skew divergence, with the items sorted by the rst eigenvector. Most of the eigenvectors appear to be piecewise constant, suggesting that the conditions for good performance in clustering are indeed present in the language data. Many of  the eigenvectors appear to correspond to a partition of the data into a small number of tight clusters. Taken as a whole they induce the clusterings reported in Table 3.</Paragraph>
    <Section position="1" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
5.1 Alignment as an evaluation tool
</SectionTitle>
      <Paragraph position="0"> Pearson correlation between corresponding elements of the Gram matrices has been suggested as a measure of agreement between a clustering and a distance measure (Christianini et al., 2002). Since we can convert a clustering into a distance measure, alignment can be used in a number of ways, including comparison of clusterings against each other.</Paragraph>
      <Paragraph position="1"> For evaluation, three alignment-based measures are particularly relevant: The alignment between the gold standard and the distance measure re ects the presence or absence in the distance measure of evidential support for the relationships that the clustering algorithm is supposed to infer. This is the column labelled \Support&amp;quot; in Table 3.</Paragraph>
      <Paragraph position="2"> The alignment between the clusters inferred by the algorithm and the distance measure re ects the con dence that the algorithm has in the relationships that it has chosen. This is the column labelled \Condence&amp;quot; in Table 3.</Paragraph>
      <Paragraph position="3"> The alignment between the gold standard and the inferred clusters re ects the quality of the result. This is the column labelled \Quality&amp;quot; in Table 3.</Paragraph>
      <Paragraph position="4"> We hope that when the algorithms are con dent they will also be right, and that when the data strongly supports a distinction the algorithms will nd it.</Paragraph>
    </Section>
    <Section position="2" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
5.2 Results
</SectionTitle>
      <Paragraph position="0"> Table 3 contains our data. The columns based on various forms of alignment have been discussed above. Clusterings are also sets of pairs, so, when the Gram matrices are discrete, we can also provide the standard measures of precision, recall and F-measure. Usually it is irrelevant whether we choose alignment or the standard measures, but the latter can yield unexpected results for extreme clusterings (many small clusters or few very big clusters). The remaining columns provide these conventional performance measures.</Paragraph>
      <Paragraph position="1"> For all the evaluation methods and all the distance measures that we have tried, the algorithm from (Ng et al., 2002) does better than direct clustering, usually nding a clustering that aligns better with the distance measure than does the gold standard. De ciencies in the result are due to weaknesses in the distance measures or the original count data, rather than search errors committed by the clustering algorithm. Skew divergence is the best distance measure, cosine is less good and cosine on binarized data the worst.</Paragraph>
    </Section>
    <Section position="3" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
5.3 Which verbs and clusters are hard?
</SectionTitle>
      <Paragraph position="0"> All three alignment measures can be applied to a clustering as whole, as above, or restricted to a subset of the Gram matrix. These can tell us how well each verb and each cluster matches the distance measure (or indeed the gold standard). To compute alignment for a verb we cal- null Gram matrix. For a cluster we do the same, but over all the rows corresponding to the cluster members. The second column of Table 4, labelled \Support&amp;quot;, gives the contribution of that verb to the alignment between the gold standard clustering and the skew-divergence distance measure (that is, the empirical support that the distance measure gives to the humanpreferred placement of the verb). The third column, labelled \Con dence&amp;quot; contains the contribution of the verb to the alignment between the skew-divergence and the clustering inferred by our algorithm (this is the measure of the con dence that the clustering algorithm has in the correctness of its placement of the verb, and is what is maximized by Ng's algorithm as we vary ). The fourth column, labelled \Correctness&amp;quot;, measures the contribution of the verb to the alignment between the inferred cluster and the gold standard (this is the measure of how correctly the verb was placed). To get a feel for performance at the cluster level we measured the alignment with the gold standard.</Paragraph>
      <Paragraph position="1"> We merged and ranked the lists proposed by skew divergence and binary cosine. The gure of merit, labelled \Score&amp;quot; is the geometric mean of the alignments for the members of the cluster. The second column, labelled \Method&amp;quot;, indicates which distance measure or measures produced this cluster. Table 5 shows this ranking. Two highly ranked clusters (Emotion and a large subset of Weather) are selected by both distance measures. The highest ranked cluster proposed only by binary cosine is a sub- null set of Position, but this is dominated by skewdivergence's correct identi cation of the whole class (see Table 2 for a reminder of the de nitions of these classes). The systematic superiority of the probabilistic measure suggests that there is after all useful information about verb classes in the non-categorical part of our verb frame data.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="2" end_page="2" type="metho">
    <SectionTitle>
6 Related work
</SectionTitle>
    <Paragraph position="0"> Levin's (Levin, 1993) classi cation has provoked several studies that aim to acquire lexical semantic information from corpora using cues pertaining to mainly syntactic structure  2000; Lapata, 1999; McCarthy, 2000; Lapata and Brew, 1999). Other work has used Levin's list of verbs (in conjunction with related lexical resources) for the creation of dictionaries that exploit the systematic correspondence between syntax and meaning (Dorr, 1997; Dang et al., 1997; Dorr and Jones, 1996).</Paragraph>
    <Paragraph position="1"> Most statistical approaches, including ours, treat verbal meaning assignment as a semantic clustering or classi cation task. The underlying question is the following: how can corpus information be exploited in deriving the semantic class for a given verb? Despite the unifying theme of using corpora and corpus distributions for the acquisition task, the approaches di er in the inventory of classes they employ, in the methodology used for inferring semantic classes and the speci c assumptions concerning the verbs to be classi ed (i.e., can they be polysemous or not).</Paragraph>
    <Paragraph position="2"> (Merlo and Stevenson, 2001) use grammatical features (acquired from corpora) to classify verbs into three semantic classes: unergative, unaccusative, and object-drop. These classes are abstractions of Levin's (Levin, 1993) classes and as a result yield a coarser classi cation.</Paragraph>
    <Paragraph position="3"> The classi er used is a decision tree learner.</Paragraph>
    <Paragraph position="4"> (Schulte im Walde, 2000) uses subcategorization information and selectional restrictions to cluster verbs into (Levin, 1993) compatible semantic classes. Subcategorization frames are induced from the BNC using a robust statistical parser (Carroll and Rooth, 1998). The selectional restrictions are acquired using Resnik's (Resnik, 1993) information-theoretic measure of selectional association which combines distributional and taxonomic information in order to formalise how well a predicate associates with a given argument.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML