File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/p03-1009_metho.xml
Size: 21,478 bytes
Last Modified: 2025-10-06 14:08:13
<?xml version="1.0" standalone="yes"?> <Paper uid="P03-1009"> <Title>Clustering Polysemic Subcategorization Frame Distributions Semantically</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Subcategorization Information </SectionTitle> <Paragraph position="0"> We obtain our SCF data using the subcategorization acquisition system of Briscoe and Carroll (1997).</Paragraph> <Paragraph position="1"> We expect the use of this system to be beneficial: it employs a robust statistical parser (Briscoe and Carroll, 2002) which yields complete though shallow parses, and a comprehensive SCF classifier, which incorporates 163 SCF distinctions, a superset of those found in the ANLT (Boguraev et al., 1987) and COMLEX (Grishman et al., 1994) dictionaries. The SCFs abstract over specific lexicallygoverned particles and prepositions and specific predicate selectional preferences but include some derived semi-predictable bounded dependency constructions, such as particle and dative movement.</Paragraph> <Paragraph position="2"> 78 of these 'coarse-grained' SCFs appeared in our data. In addition, a set of 160 fine grained frames were employed. These were obtained by parameterizing two high frequency SCFs for prepositions: the simple PP and NP + PP frames. The scope was restricted to these two frames to prevent sparse data problems in clustering.</Paragraph> <Paragraph position="3"> A SCF lexicon was acquired using this system from the British National Corpus (Leech, 1992,</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> TEST GOLD STANDARD TEST GOLD STANDARD TEST GOLD STANDARD TEST GOLD STANDARD VERB CLASSES VERB CLASSES VERB CLASSES VERB CLASSES </SectionTitle> <Paragraph position="0"> place 9 dye 24, 21, 41 focus 31, 45 stare 30 lay 9 build 26, 45 force 002, 11 glow 43 drop 9, 45, 004, 47, bake 26, 45 persuade 002 sparkle 43 51, A54, A30 pour 9, 43, 26, 57, 13, 31 invent 26, 27 urge 002, 37 dry 45 load 9 publish 26, 25 want 002, 005, 29, 32 shut 45 settle 9, 46, A16, 36, 55 cause 27, 002 need 002, 005, 29, 32 hang 47, 9, 42, 40 fill 9, 45, 47 generate 27, 13, 26 grasp 30, 15 sit 47, 9 remove 10, 11, 42 induce 27, 002, 26 understand 30 disappear 48 withdraw 10, A30 acknowledge 29, A25, A35 conceive 30, 29, A56 vanish 48 wipe 10, 9 proclaim 29, 37, A25 consider 30, 29 march 51 brush 10, 9, 41, 18 remember 29, 30 perceive 30 walk 51 filter 10 imagine 29, 30 analyse 34, 35 travel 51 send 11, A55 specify 29 evaluate 34, 35 hurry 53, 51 ship 11, A58 establish 29, A56 explore 35, 34 rush 53, 51 transport 11, 31 suppose 29, 37 investigate 35, 34 begin 55 carry 11, 54 assume 29, A35, A57 agree 36, 22, A42 continue 55, 47, 51 drag 11, 35, 51, 002 think 29, 005 communicate 36, 11 snow 57, 002 push 11, 12, 23, 9, 002 confirm 29 shout 37 rain 57 pull 11, 12, 13, 23, 40, 016 believe 29, 31, 33 whisper 37 sin 003 give 13 admit 29, 024, 045, 37 talk 37 rebel 003 lend 13 allow 29, 024, 13, 002 speak 37 risk 008, A7 study 14, 30, 34, 35 act 29 say 37, 002 gamble 008, 009 hit 18, 17, 47, A56, 31, 42 behave 29 mention 37 beg 015, 32 bang 18, 43, 9, 47, 36 feel 30, 31, 35, 29 eat 39 pray 015, 32 carve 21, 25, 26 see 30, 29 drink 39 seem 020 add 22, 37, A56 hear 30, A32 laugh 40, 37 appear 020, 48, 29 mix 22, 26, 36 notice 30, A32 smile 40, 37 colour 24, 31, 45 concentrate 31, 45 look 30, 35 used per test verb. The lexicon was evaluated against manually analysed corpus data after an empirically defined threshold of 0.025 was set on relative frequencies of SCFs to remove noisy SCFs. The method yielded 71.8% precision and 34.5% recall. When we removed the filtering threshold, and evaluated the noisy distribution, F-measure4 dropped from 44.9 to</Paragraph> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Clustering Method </SectionTitle> <Paragraph position="0"> Data clustering is a process which aims to partition a given set into subsets (clusters) of elements that are similar to one another, while ensuring that elements that are not similar are assigned to different clusters.</Paragraph> <Paragraph position="1"> We use clustering for partitioning a set of verbs. Our hypothesis is that information about SCFs and their associated frequencies is relevant for identifying semantically related verbs. Hence, we use SCFs as relevance features to guide the clustering process.6</Paragraph> <Paragraph position="3"> These figures are not particularly impressive because our evaluation is exceptionally hard. We use 1) highly polysemic test verbs, 2) a high number of SCFs and 3) evaluate against manually analysed data rather than dictionaries (the latter have high precision but low recall).</Paragraph> <Paragraph position="4"> 6The relevance of the features to the task is evident when comparing the probability of a randomly chosen pair of verbs verbi and verbj to share the same predominant sense (4.5%) with the probability obtained when verbj is the JS-divergence We chose two clustering methods which do not involve task-oriented tuning (such as pre-fixed thresholds or restricted cluster sizes) and which approach data straightforwardly, in its distributional form: (i) a simple hard method that collects the nearest neighbours (NN) of each verb (figure 1), and (ii) the Information Bottleneck (IB), an iterative soft method (Tishby et al., 1999) based on information-theoretic grounds.</Paragraph> <Paragraph position="5"> The NN method is very simple, but it has some disadvantages. It outputs only one clustering configuration, and therefore does not allow examination of different cluster granularities. It is also highly sensitive to noise. Few exceptional neighbourhood relations contradicting the typical trends in the data are enough to cause the formation of a single cluster which encompasses all elements.</Paragraph> <Paragraph position="6"> Therefore we employed the more sophisticated IB method as well. The IB quantifies the relevance information of a SCF distribution with respect to output clusters, through their mutual information I(Clusters; SCFs). The relevance information is maximized, while the compression information I(Clusters;V erbs) is minimized. This ensures optimal compression of data through clusters.</Paragraph> <Paragraph position="7"> The tradeoff between the two constraints is realized nearest neighbour of verbi (36%).</Paragraph> <Paragraph position="8"> NN Clustering: 1. For each verb v: 2. Calculate the JS divergence between the SCF distributions of v and all other verbs: 3. Connect v with the most similar verb; 4. Find all the connected components clustering. D is the Kullback-Leibler distance.</Paragraph> <Paragraph position="9"> through minimizing the cost term:</Paragraph> <Paragraph position="11"> where b is a parameter that balances the constraints.</Paragraph> <Paragraph position="12"> The IB iterative algorithm finds a local minimum of the above cost term. It takes three inputs: (i) SCFverb distributions, (ii) the desired number of clusters K, and (iii) the value of b.</Paragraph> <Paragraph position="13"> Starting from a random configuration, the algorithm repeatedly calculates, for each cluster K, verb V and SCF S, the following probabilities: (i) the marginal proportion of the cluster p(K); (ii) the probability p(S|K) for a SCF to occur with members of the cluster; and (iii) the probability p(K|V ) for a verb to be assigned to the cluster. These probabilities are used, each in its turn, for calculating the other probabilities (figure 2). The collection of all p(S|K)'s for a fixed cluster K can be regarded as a probabilistic center (centroid) of that cluster in the SCF space.</Paragraph> <Paragraph position="14"> The IB method gives an indication of the most informative values of K.7 Intensifying the weight b attached to the relevance information I(Clusters; SCFs) allows us to increase the number K of distinct clusters being produced (while too small b would cause some of the output clusters to be identical to one another). Hence, the relevance information grows with K. Accordingly, we consider as the most informative output configurations those for which the relevance information increases more sharply between K[?]1 and K clusters than between K and K + 1.</Paragraph> <Paragraph position="15"> 7Most works on clustering ignore this issue and refer to an arbitrarily chosen number of clusters, or to the number of gold standard classes, which cannot be assumed in realistic applications. null</Paragraph> <Paragraph position="17"> When the weight of relevance grows, the assignment to clusters is more constrained and p(K|V ) becomes more similar to hard clustering. Let</Paragraph> <Paragraph position="19"> denote the most probable cluster of a verb V .</Paragraph> <Paragraph position="20"> For K [?] 30, more than 85% of the verbs have p(K(V )|V ) > 90% which makes the output clustering approximately hard. For this reason, we decided to use only K(V ) as output and defer a further exploration of the soft output to future work.</Paragraph> </Section> <Section position="7" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 Experimental Evaluation </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.1 Data </SectionTitle> <Paragraph position="0"> The input data to clustering was obtained from the automatically acquired SCF lexicon for our 110 test verbs (section 2). The counts were extracted from unfiltered (noisy) SCF distributions in this lexicon.8 The NN algorithm produced 24 clusters on this input. From the IB algorithm, we requested K = 2 to 60 clusters. The upper limit was chosen so as to slightly exceed the case when the average cluster size 110/K = 2. We chose for evaluation the IB results for K = 25, 35 and 42. For these values, the SCF relevance satisfies our criterion for a notable improvement in cluster quality (section 4).</Paragraph> <Paragraph position="1"> The value K=35 is very close to the actual number (34) of predominant senses in the gold standard. In this way, the IB yields structural information beyond clustering.</Paragraph> <Paragraph position="2"> 8This yielded better results, which might indicate that the unfiltered &quot;noisy&quot; SCFs contain information which is valuable for the task.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.2 Method </SectionTitle> <Paragraph position="0"> A number of different strategies have been proposed for evaluation of clustering. We concentrate here on those which deliver a numerical value which is easy to interpret, and do not introduce biases towards specific numbers of classes or class sizes. As we currently assign a single sense to each polysemic verb (sec. 5.4) the measures we use are also applicable for evaluation against a polysemous gold standard.</Paragraph> <Paragraph position="1"> Our first measure, the adjusted pairwise precision (APP), evaluates clusters in terms of verb pairs (Schulte im Walde and Brew, 2002) 9:</Paragraph> <Paragraph position="3"> num. of correct pairs in ki num. of pairs in ki *</Paragraph> <Paragraph position="5"> APP is the average proportion of all within-cluster pairs that are correctly co-assigned. It is multiplied by a factor that increases with cluster size. This factor compensates for a bias towards small clusters.</Paragraph> <Paragraph position="6"> Our second measure is derived from purity, a global measure which evaluates the mean precision of the clusters, weighted according to the cluster size (Stevenson and Joanis, 2003). We associate with each cluster its most prevalent semantic class, and denote the number of verbs in a cluster K that take its prevalent class by nprevalent(K). Verbs that do not take this class are considered as errors. Given our task, we are only interested in classes which contain two or more verbs. We therefore disregard those clusters where nprevalent(K) = 1. This leads us to define modified purity:</Paragraph> <Paragraph position="8"> number of verbs .</Paragraph> <Paragraph position="9"> The modification we introduce to purity removes the bias towards the trivial configuration comprised of only singletons.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.3 Evaluation Against the Predominant Sense </SectionTitle> <Paragraph position="0"> We first evaluated the clusters against the predominant sense, i.e. using the monosemous gold standard. The results, shown in Table 2, demonstrate with and without prepositions. The last entry presents the performance of random clustering with K = 25, which yielded the best results among the three values K=25, 35 and 42.</Paragraph> <Paragraph position="1"> better on the task than our random clustering baseline. Both methods show clearly better performance with fine-grained SCFs (with prepositions, +PP) than with coarse-grained ones (-PP).</Paragraph> <Paragraph position="2"> Surprisingly, the simple NN method performs very similarly to the more sophisticated IB. Being based on pairwise similarities, it shows better performance than IB on the pairwise measure. The IB is, however, slightly better according to the global measure (2% with K = 42). The fact that the NN method performs better than the IB with similar K values (NN K = 24 vs. IB K = 25) seems to suggest that the JS divergence provides a better model for the predominant class than the compression model of the IB. However, it is likely that the IB performance suffered due to our choice of test data. As the method is global, it performs better when the target classes are represented by a high number of verbs.</Paragraph> <Paragraph position="3"> In our experiment, many semantic classes were represented by two verbs only (section 2).</Paragraph> <Paragraph position="4"> Nevertheless, the IB method has the clear advantage that it allows for more clusters to be produced. At best it classified half of the verbs correctly according to their predominant sense (mPUR = 50%).</Paragraph> <Paragraph position="5"> Although this leaves room for improvement, the result compares favourably to previously published results10. We argue, however, that evaluation against a monosemous gold standard reveals only part of the picture.</Paragraph> <Paragraph position="6"> 10Due to differences in task definition and experimental setup, a direct comparison with earlier results is impossible. For example, Stevenson and Joanis (2003) report an accuracy of 29% (which implies mPUR [?] 29%), but their task involves classifying 841 verbs to 14 classes based on differences in the predicate-argument structure.</Paragraph> <Paragraph position="7"> K Pred. Multiple Pred. Multiple sense senses sense senses are results of evaluation on randomly polysemous data + significance of the actual figure. Results were obtained with fine-grained SCFs (including prepositions).</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.4 Evaluation Against Multiple Senses </SectionTitle> <Paragraph position="0"> In evaluation against the polysemic gold standard, we assume that a verb which is polysemous in our corpus data may appear in a cluster with verbs that share any of its senses. In order to evaluate the clusters against polysemous data, we assigned each polysemic verb V a single sense: the one it shares with the highest number of verbs in the cluster K(V ).</Paragraph> <Paragraph position="1"> Table 3 shows the results against polysemic and monosemous gold standards. The former are noticeably better than the latter (e.g. IB with K = 42 is 9% better). Clearly, allowing for multiple gold standard classes makes it easier to obtain better results with evaluation.</Paragraph> <Paragraph position="2"> In order to show that polysemy makes a non-trivial contribution in shaping the clusters, we measured the improvement that can be due to pure chance by creating randomly polysemous gold standards. We constructed 100 sets of random gold standards. In each iteration, the verbs kept their original predominant senses, but the set of additional senses was taken entirely from another verb - chosen at random. By doing so, we preserved the dominant sense of each verb, the total frequency of all senses and the correlations between the additional senses.</Paragraph> <Paragraph position="3"> The results included in table 3 indicate, with 99.5% confidence (3s and above), that the improvement obtained with the polysemous gold standard is not artificial (except in two cases with 95% confidence). null</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.5 Qualitative Analysis of Polysemy </SectionTitle> <Paragraph position="0"> We performed qualitative analysis to further investigate the effect of polysemy on clustering perfor- null function of the number of different senses between pair members (results of the NN algorithm) function of the number of shared senses (results of the NN algorithm) null mance. The results in table 4 demonstrate that the more two verbs differ in their senses, the lower their chance of ending up in the same cluster. From the figures in table 5 we see that the probability of two verbs to appear in the same cluster increases with the number of senses they share. Interestingly, it is not only the degree of polysemy which influences the results, but also the type. For verb pairs where at least one of the members displays 'irregular' polysemy (i.e. it does not share its full set of senses with any other verb), the probability of co-occurrence in the same cluster is far lower than for verbs which are polysemic in a 'regular' manner (Table 5).</Paragraph> <Paragraph position="1"> Manual cluster analysis against the polysemic gold standard revealed a yet more comprehensive picture. Consider the following clusters (the IB output with K = 42): A1: talk (37), speak (37) A2: look (30, 35), stare (30) A3: focus (31, 45), concentrate (31, 45) A4: add (22, 37, A56) We identified a close relation between the clustering performance and the following patterns of semantic behaviour: 1) Monosemy: We had 32 monosemous test verbs. 10 gold standard classes included 2 or more or these. 7 classes were correctly acquired using clustering (e.g. A1), indicating that clustering monosemous verbs is fairly 'easy'.</Paragraph> <Paragraph position="2"> 2) Predominant sense: 10 clusters were examined by hand whose members got correctly classified together, despite one of them being polysemous (e.g. A2). In 8 cases there was a clear indication in the data (when examining SCFs and the selectional preferences on argument heads) that the polysemous verb indeed had its predominant sense in the relevant class and that the co-occurrence was not due to noise.</Paragraph> </Section> </Section> <Section position="8" start_page="0" end_page="0" type="metho"> <SectionTitle> 3) Regular Polysemy: Several clusters were pro- </SectionTitle> <Paragraph position="0"> duced which represent linguistically plausible intersective classes (e.g. A3) (Dang et al., 1998) rather than single classes.</Paragraph> <Paragraph position="1"> 4) Irregular Polysemy: Verbs with irregular polysemy11 were frequently assigned to singleton clusters. For example, add (A4) has a 'combining and attaching' sense in class 22 which involves NP and PP SCFs and another 'communication' sense in 37 which takes sentential SCFs. Irregular polysemy was not a marginal phenomenon: it explains 5 of the 10 singletons in our data.</Paragraph> <Paragraph position="2"> These observations confirm that evaluation against a polysemic gold standard is necessary in order to fully explain the results from clustering.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.6 Qualitative Analysis of Errors </SectionTitle> <Paragraph position="0"> Finally, to provide feedback for further development of our verb classification approach, we performed a qualitative analysis of errors not resulting from polysemy. Consider the following clusters (the IB output for K = 42): B1: place (9), build (26, 45), publish (26, 25), carve (21, 25, 26) B2: sin (003), rain (57), snow (57, 002) B3: agree (36, 22, A42), appear (020, 48, 29), begin (55), continue (55, 47, 51) B4: beg (015, 32) Three main error types were identified: 1) Syntactic idiosyncracy: This was the most frequent error type, exemplified in B1, where place is incorrectly clustered with build, publish and carve merely because it takes similar prepositions to these verbs (e.g. in, on, into).</Paragraph> <Paragraph position="1"> 2) Sparse data: Many of the low frequency verbs (we had 12 with frequency less than 300) performed 11Recall our definition of irregular polysemy, section 5.4. poorly. In B2, sin (which had 53 occurrences) is classified with rain and snow because it does not occur in our data with the preposition against the 'hallmark' of its gold standard class ('Conspire Verbs').</Paragraph> <Paragraph position="2"> 3) Problems in SCF acquisition: These were not numerous but occurred e.g. when the system could not distinguish between different control (e.g. subject/object equi/raising) constructions (B3).</Paragraph> </Section> </Section> class="xml-element"></Paper>