XML Viewer - j01-3003

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/j01-3003_metho.xml
Size: 70,216 bytes
Last Modified: 2025-10-06 14:07:31
<?xml version="1.0" standalone="yes"?>
<Paper uid="J01-3003">
  <Title>Automatic Verb Classification Based on Statistical Distributions of Argument Structure</Title>
  <Section position="4" start_page="379" end_page="380" type="metho">
    <SectionTitle>
3. Data Collection and Analysis
</SectionTitle>
    <Paragraph position="0"> Clearly, some of the features we've proposed are difficult (e.g., the passive use) or impossible (e.g., animate subject use) to automatically extract with high accuracy from a 2 For our sample verbs, the statistical correlation between the transitive and passive features is highly significant (N ----- 59, R = .44, p =- .001), as is the correlation between the transitive and past participle features (N = 59, R = .36, p = .005). (Since, as explained in the next section, our features are expressed as proportions---e.g., percent transitive use out of detected transitive and intransitive use----correlations of intransitivity with passive or past participle use have the same magnitude but are negative.)  The features and expected behavior.</Paragraph>
    <Section position="1" start_page="380" end_page="380" type="sub_section">
      <SectionTitle>
Expected Frequency
Feature Pattern Explanation
</SectionTitle>
      <Paragraph position="0"> Transitivity Unerg &lt; Unacc &lt; ObjDrop Unaccusatives and unergatives have a causative transitive, hence lower transitive use. Furthermore, unergatives have an agentive object, hence very low transitive use.</Paragraph>
      <Paragraph position="1"> Causativity Unerg, ObjDrop &lt; Unacc Object-drop verbs do not have a causal agent, hence low &amp;quot;causative&amp;quot; use. Unergatives are rare in the transitive, hence low causative use.</Paragraph>
      <Paragraph position="2"> Animacy Unacc &lt; Unerg, ObjDrop Unaccusatives have a Theme subject in the intransitive, hence lower use of animate subjects.</Paragraph>
      <Paragraph position="3"> Passive Voice Unerg K Unacc K ObjDrop Passive implies transitive use, hence correlated with transitive feature.</Paragraph>
      <Paragraph position="4"> VBN Tag Unerg &lt; Unacc &lt; ObjDrop Passive implies past participle use (VBN), hence correlated with transitive (and passive).</Paragraph>
      <Paragraph position="5"> large corpus, given the current state of annotation. However, we do assume that currently available corpora, such as the Wall Street Journal (WSJ), provide a representative, and large enough, sample of language from which to gather corpus counts that can approximate the distributional patterns of the verb class alternations. Our work draws on two text corpora--one an automatically tagged combined corpus of 65 million words (primarily WSJ), the second an automatically parsed corpus of 29 million words (a sub-set of the WSJ text from the first corpus). Using these corpora, we develop counting procedures that yield relative frequency distributions for approximations to the five linguistic features we have determined, over a sample of verbs from our three classes.</Paragraph>
    </Section>
    <Section position="2" start_page="380" end_page="380" type="sub_section">
      <SectionTitle>
3.1 Materials and Method
</SectionTitle>
      <Paragraph position="0"> We chose a set of 20 verbs from each class based primarily on the classification in Levin (1993). 3 The complete list of verbs appears in Table 5; the group 1/group 2 designation is explained below in the section on counting. As indicated in the table, unergatives are manner-of-motion verbs (from the &amp;quot;run&amp;quot; class in Levin), unaccusatives are change-of-state verbs (from several of the classes in Levin's change-of-state super-class), while object-drop verbs were taken from a variety of classes in Levin's classification, all of which undergo the unexpressed object alternation. The most frequently used classes are verbs of change of possession, image-creation verbs, and verbs of creation and transformation. The selection of verbs was based partly on our intuitive judgment that the verbs were likely to be used with sufficient frequency in the WSJ. Also, each</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="380" end_page="386" type="metho">
    <SectionTitle>
3 We used an equal number of verbs from each class in order to have a balanced group of items. One
</SectionTitle>
    <Paragraph position="0"> potential disadvantage of this decision is that each verb class is represented equally, even though they may not be equally frequent in the corpora. Although we lose the relative frequency information among the classes that could provide a better bias for assigning a default classification (i.e., the most frequent one), we have the advantage that our classifier will be equally informed (in terms of number of exemplars) about each class.</Paragraph>
    <Paragraph position="1"> Note that there are only 19 unaccusative verbs because ripped, which was initially counted in the unaccusatives, was then excluded from the analysis as it occurred mostly in a very different usage in the corpus (as verb+particle, in ripped off) from the intended optionally intransitive usage.</Paragraph>
    <Paragraph position="2">  Computational Linguistics Volume 27, Number 3 Table 5 Verbs used in the experiments.</Paragraph>
    <Paragraph position="3"> Class Name Description Selected Verbs Unergative manner of motion jumped, rushed, marched, leaped, floated, raced, hurried, wandered, vaulted, paraded (group 1); galloped, glided, hiked, hopped, jogged, scooted, scurried, skipped, tiptoed, trotted (group 2).</Paragraph>
    <Paragraph position="4"> Unaccusative change of state opened, exploded, flooded, dissolved, cracked, hardened, boiled, melted, fractured, solidified (group 1); collapsed, cooled, folded, widened, changed, cleared, divided, simmered, stabilized (group 2).</Paragraph>
    <Paragraph position="5"> Object-Drop unexpressed object alternation played, painted, kicked, carved, reaped, washed, danced, yelled, typed, knitted (group 1); borrowed, inherited, organized, rented, sketched, cleaned, packed, studied, swallowed, called (group 2).</Paragraph>
    <Paragraph position="6"> verb presents the same form in the simple past and in the past participle (the regular &amp;quot;-ed&amp;quot; form). In order to simplify the counting procedure, we included only the &amp;quot;-ed&amp;quot; form of the verb, on the assumption that counts on this single verb form would approximate the distribution of the features across all forms of the verb. Additionally, as far as we were able given the preceding constraints, we selected verbs that could occur in the transitive and in the passive. Finally, we aimed for a frequency cut-off of 10 occurrences or more for each verb, although for unergatives we had to use one verb (jogged) that only occurred 8 times in order to have 20 verbs that satisfied the other criteria above.</Paragraph>
    <Paragraph position="7"> In performing this kind of corpus analysis, one has to recognize the fact that current corpus annotations do not distinguish verb senses. In these counts, we did not distinguish a core sense of the verb from an extended use of the verb. So, for instance, the sentence Consumer spending jumped 1.7% in February after a sharp drop the month before (WSJ 1987) is counted as an occurrence of the manner-of-motion verb jump in its intransitive form. This particular sense extension has a transitive alternant, but not a causative transitive (i.e., Consumer spending jumped the barrier .... but not Low taxes jumped consumer spending... ). Thus, while the possible subcategorizations remain the same, rates of transitivity and causativity may be different than for the literal manner-of-motion sense. This is an unavoidable result of using simple, automatic extraction methods given the current state of annotation of corpora.</Paragraph>
    <Paragraph position="8"> For each occurrence of each verb, we counted whether it was in a transitive or intransitive use (TRANS), in a passive or active use (PASS), in a past participle or simple past use (VBN), in a causative or non-causative use (CAUS), and with an animate subject or not (ANIM). 4 Note that, except for the VBN feature, for which we simply extract the POS tag from the corpus, all other counts are approximations to the actual linguistic behaviour of the verb, as we describe in detail below.</Paragraph>
    <Paragraph position="9"> 4 One additional feature was recorded--the log frequency of the verb in the 65 million word corpus--motivated by the conjecture that the frequency of a verb may help in predicting its class. In our machine learning experiments, however, this conjecture was not borne out, as the frequency feature did not improve performance. This is the case for experiments on all of the verbs, as well as for separate experiments on the group 1 verbs (which were matched across the classes for frequency) and the group 2 verbs (which were not). We therefore limit discussion here to the thematically-motivated features.</Paragraph>
    <Paragraph position="10">  Merlo and Stevenson Statistical Verb Classification The first three counts (TRANS, PASS, VBN) were performed on the tagged ACL/DCI corpus available from the Linguistic Data Consortium, which includes the Brown Corpus (of one million words) and years 1987-1989 of the Wall Street Journal, a combined corpus in excess of 65 million words. The counts for these features proceeded as follows: null * TRANS: A number, a pronoun, a determiner, an adjective, or a noun were considered to be indication of a potential object of the verb. A verb occurrence preceded by forms of the verb be, or immediately followed by a potential object was counted as transitive; otherwise, the occurrence was counted as intransitive (specifically, if the verb was followed by a punctuation sign--commas, colons, full stops--or by a conjunction, a particle, a date, or a preposition.)  * PASS: A main verb (i.e., tagged VBD) was counted as active. A token with tag VBN was also counted as active if the closest preceding auxiliary was have, while it was counted as passive if the closest preceding auxiliary was be.</Paragraph>
    <Paragraph position="11"> * VBN: The counts for VBN/VBD were simply done based on the POS  label within the tagged corpus.</Paragraph>
    <Paragraph position="12"> Each of the above three counts was normalized over all occurrences of the &amp;quot;-ed&amp;quot; form of the verb, yielding a single relative frequency measure for each verb for that feature; i.e., percent transitive (versus intransitive) use, percent active (versus passive) use, and percent VBN (versus VBD) use, respectively.</Paragraph>
    <Paragraph position="13"> The last two counts (CAUS and ANIM) were performed on a parsed version of the 1988 year of the Wall Street Journal, so that we could extract subjects and objects of the verbs more accurately. This corpus of 29 million words was provided to us by Michael Collins, and was automatically parsed with the parser described in Collins (1997). 5 The counts, and their justification, are described here: CAUS: As discussed above, the object of a causative transitive is the same semantic argument of the verb as the subject of the intransitive. The causative feature was approximated by the following steps, intended to capture the degree to which the subject of a verb can also occur as its object. Specifically, for each verb occurrence, the subject and object (if there was one) were extracted from the parsed corpus. The observed subjects across all occurrences of the verb were placed into one multiset of nouns, and the observed objects into a second multiset of nouns. (A multiset, or bag, was used so that our representation indicated the number of times each noun was used as either subject or object.) Then, the proportion of overlap between the two multisets was calculated. We define overlap as the largest multiset of elements belonging to both the 5 Readers might be concerned about the portability of this method to languages for which no large parsed corpus is available. It is possible that using a fully parsed corpus is not necessary. Our results were replicated in English without the need for a fully parsed corpus (Anoop Sarkar, p.c., citing a project report by Wootiporn Tripasai). Our method was applied to 23 million words of the WSJ that were automatically tagged with Ratnaparkhi's maximum entropy tagger (Ratnaparkhi 1996) and chunked with the partial parser CASS (Abney 1996). The results are very similar to ours (best accuracy 66.6%), suggesting that a more accurate tagger than the one used on our corpus might in fact be sufficient to overcome the fact that no full parse is available.</Paragraph>
    <Paragraph position="14">  Computational Linguistics Volume 27, Number 3 subject and the object multisets; e.g., the overlap between (a, a, a, b} and {a} is {a,a,a}. The proportion is the ratio between the cardinality of the overlap multiset, and the sum of the cardinality of the subject and object multisets. For example, for the simple sets of characters above, the ratio would be 3/5, yielding a value of .60 for the CAUS feature.</Paragraph>
    <Paragraph position="15"> ANIM: A problem with a feature like animacy is that it requires either manual determination of the animacy of extracted subjects, or reference to an on-line resource such as WordNet for determining animacy. To approximate animacy with a feature that can be extracted automatically, and without reference to a resource external to the corpus, we take advantage of the well-attested animacy hierarchy, according to which pronouns are the most animate (Silverstein 1976; Dixon 1994). The hypothesis is that the words I, we, you, she, he, and they most often refer to animate entities. This hypothesis was confirmed by extracting 100 occurrences of the pronoun they, which can be either animate or inanimate, from our 65 million word corpus. The occurrences immediately preceded a verb. After eliminating repetitions, 94 occurrences were left, which were classified by hand, yielding 71 animate pronouns, 11 inanimate pronouns and 12 unclassified occurrences (for lack of sufficient context to recover the antecedent of the pronoun with certainty). Thus, at least 76% of usages of they were animate; we assume the percentage of animate usages of the other pronouns to be even higher. Since the hypothesis was confirmed, we count pronouns (other than it) in subject position (Kariaeva \[1999\]; cf.</Paragraph>
    <Paragraph position="16"> Aone and McKee \[1996\]). The values for the feature were determined by automatically extracting all subject/verb tuples including our 59 example verbs from the parsed corpus, and computing the ratio of occurrences of pronoun subjects to all subjects for each verb.</Paragraph>
    <Paragraph position="17"> Finally, as indicated in Table 5, the verbs are designated as belonging to &amp;quot;group 1&amp;quot; or &amp;quot;group 2&amp;quot;. All the verbs are treated equally in our data analysis and in the machine learning experiments, but this designation does indicate a difference in details of the counting procedures described above. The verbs in group I had been used in an earlier study in which it was important to minimize noisy data (Stevenson and Merlo 1997a), so they generally underwent greater manual intervention in the counts. In adding group 2 for the classification experiment, we chose to minimize the intervention in order to demonstrate that the classification process is robust enough to withstand the resulting noise in the data.</Paragraph>
    <Paragraph position="18"> For group 2, the transitivity, voice, and VBN counts were done automatically without any manual intervention. For group 1, these three counts were done automatically by regular expression patterns, and then subjected to correction, partly by hand and partly automatically, by one of the authors. For transitivity, the adjustments vary for the individual verbs. Most of the reassignments from a transitive to an intransitive labelling occurred when the following noun was not the direct object but rather a measure phrase or a date. Most of the reassignments from intransitive to transitive occurred when a particle or a preposition following the verb did not introduce a prepositional phrase, but instead indicated a passive form (by) or was part of a phrasal verb. Some verbs were mostly used adjectivally, in which case they were excluded from the transitivity counts. For voice, the required adjustments included cases of coordination of the past participle when the verb was preceded by a conjunction, or a comma.  These were collected and classified by hand as passive or active based on intuition. Similarly, partial adjustments to the VBN counts were made by hand.</Paragraph>
    <Paragraph position="19"> For the causativity feature, subjects and objects were determined by manual inspection of the corpus for verbs belonging to group 1, while they were extracted automatically from the parsed corpus for group 2. The group 1 verbs were sampled in three ways, depending on total frequency. For verbs with less than 150 occurrences, all instances of the verbs were used for subject/object extraction. For verbs whose total frequency was greater than 150, but whose VBD frequency was in the range 100-200, we extracted subjects and objects of the VBD occurrences only. For higher frequency verbs, we used only the first 100 VBD occurrences. 6 The same script for computing the overlap of the extracted subjects and objects was then used on the resulting subject/verb and verb/object tuples for both group 1 and group 2 verbs. The animacy feature was calculated over subject/verb tuples extracted automatically for both groups of verbs from the parsed corpus.</Paragraph>
    <Section position="1" start_page="384" end_page="386" type="sub_section">
      <SectionTitle>
3.2 Data Analysis
</SectionTitle>
      <Paragraph position="0"> The data collection described above yields the following data points in total: TRANS: 27403; PASS: 20481; VBN: 36297; CAt;S: 11307; ANIM: 7542. (Different features yield different totals because they were sampled independently, and the search patterns to extract some features are more imprecise than others.) The aggregate means by class of the normalized frequencies for all verbs are shown in Table 6; item by item distributions are provided in Appendix A, and raw counts are available from the authors. Note that aggregate means are shown for illustration purposes only--all machine learning experiments are performed on the individual normalized frequencies for each verb, as given in Appendix A.</Paragraph>
      <Paragraph position="1"> The observed distributions of each feature are indeed roughly as expected according to the description in Section 2. Unergatives show a very low relative frequency of the TRANS feature, followed by unaccusatives, then object-drop verbs. Unaccusative verbs show a high frequency of the CAUS feature and a low frequency of the ANIM feature compared to the other classes. Somewhat unexpectedly, object-drop verbs exhibit a non-zero mean CAUS value (almost half the verbs have a CAUS value greater than zero), leading to a three-way causative distinction among the verb classes. We suspect that the approximation that we used for causative use--the overlap between subjects 6 For this last set of high-frequency verbs (exploded, jumped, opened, played, rushed), we used the first 100 occurrences as the simplest way to collect the sample. In response to an anonymous reviewer's concern, we later verified that these counts were not different from counts obtained by random sampling of 100 VBD occurrences. A paired t-test of the two sets of counts (first 100 sampling and random sampling) indicates that the two sets of counts are not statistically different (t = 1.283, DF = 4, p = 0.2687).</Paragraph>
      <Paragraph position="2">  and objects for a verb--also captures a &amp;quot;reciprocity&amp;quot; effect for some object-drop verbs (such as call), in which subjects and objects can be similar types of entities. Finally, although expected to be a redundant indicator of transitivity, PASS and VBN, unlike TRANS, have very similar values for unaccusative and object-drop verbs, indicating that their distributions are sensitive to factors we have not yet investigated.</Paragraph>
      <Paragraph position="3"> One issue we must address is how precisely the automatic counts reflect the actual linguistic behaviour of the verbs. That is, we must be assured that the patterns we note in the data in Table 6 are accurate reflections of the differential behaviour of the verb classes, and not an artifact of the way in which we estimate the features, or a result of inaccuracies in the counts. In order to evaluate the accuracy of our feature counts, we selected two verbs from each class, and determined the &amp;quot;true&amp;quot; value of each feature for each of those six verbs through manual counting. The six verbs were randomly selected from the group 2 subset of the verbs, since counts for group 2 verbs (as explained above) had not undergone manual correction. This allows us to determine the accuracy of the fully automatic counting procedures. The selected verbs (and their frequencies) are: hopped (29), scurried (21), folded (189), stabilized (286), inherited (357), swallowed (152). For verbs that had a frequency of over 100 in the &amp;quot;-ed&amp;quot; form, we performed the manual counts on the first 100 occurrences.</Paragraph>
      <Paragraph position="4"> Table 7 shows the results of the manual counts, reported as proportions to facilitate comparison to the normalized automatic counts, shown in adjoining columns.</Paragraph>
      <Paragraph position="5"> We observe first that, overall, most errors in the automatic counts occur in the unaccusative and object-drop verbs. While tagging errors affect the VBN feature for all of the verbs somewhat, we note that TP~ANS and PaSS are consistently underestimated for unaccusative and object-drop verbs. These errors make the unaccusative and object-drop feature values more similar to each other, and therefore potentially harder to distinguish. Furthermore, because the TRANS and PASS values are underestimated by the automatic counts, and therefore lower in value, they are also closer to the values for the unergative verbs. For the CAUS feature, we predict the highest values for the unaccusative verbs, and while that prediction is confirmed, the automatic counts for that class also show the most errors. Finally, although the general pattern of higher values for the ANIM feature of unergatives and object-drop verbs is preserved in the automatic counts, the feature is underestimated for almost all the verbs, again making the values for that feature closer across the classes than they are in reality.</Paragraph>
      <Paragraph position="6"> We conclude that, although there are inaccuracies in all the counts, the general patterns expected based on our analysis of the verb classes hold in both the manual and automatic counts. Errors in the estimating and counting procedures are therefore  Merlo and Stevenson Statistical Verb Classification not likely to be responsible for the pattern of data in Table 6 above, which generally matches our predictions. Furthermore, the errors, at least for this random sample of verbs, occur in a direction that makes our task of distinguishing the classes more difficult, and indicates that developing more accurate search patterns may possibly sharpen the class distinctions, and improve the classification performance.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="386" end_page="394" type="metho">
    <SectionTitle>
4. Experiments in Classification
</SectionTitle>
    <Paragraph position="0"> In this section, we turn to our computational experiments that investigate whether the statistical indicators of thematic properties that we have developed can in fact be used to classify verbs. Recall that the task we have set ourselves is that of automatically learning the best class for a set of usages of a verb, as opposed to classifying individual occurrences of the verb. The frequency distributions of our features yield a vector for each verb that represents the estimated values for the verb on each dimension across the entire corpus: Vector template: \[verb-name, TRANS, PASS, VBN, CAUS, ANIM, class\] Example: \[opened, .69, .09, .21, .16, .36, unacc\] The resulting set of 59 vectors constitutes the data for our machine learning experiments. We use this data to train an automatic classifier to determine, given the feature values for a new verb (not from the training set), which of the three major classes of English optionally intransitive verbs it belongs to.</Paragraph>
    <Section position="1" start_page="386" end_page="387" type="sub_section">
      <SectionTitle>
4.1 Experimental Methodology
</SectionTitle>
      <Paragraph position="0"> In pilot experiments on a subset of the features, we investigated a number of supervised machine learning methods that produce automatic classifiers (decision tree induction, rule learning, and two types of neural networks), as well as hierarchical clustering; see Stevenson et al. (1999) for more detail. Because we achieved approximately the same level of performance in all cases, we narrowed our further experimentation to the publicly available version of the C5.0 machine learning system (http://www.rulequest.com), a newer version of C4.5 (Quinlan 1992), due to its ease of use and wide availability. The C5.0 system generates both decision trees and corresponding rule sets from a training set of known classifications. In our experiments, we found little to no difference in performance between the trees and rule sets, and report only the rule set results.</Paragraph>
      <Paragraph position="1"> In the experiments below, we follow two methodologies in training and testing, each of which tests a subset of cases held out from the training data. Thus, in all cases, the results we report are on test data that was never seen in training. 7 The first training and testing methodology we follow is 10-fold cross-validation. In this approach, the system randomly divides the data into ten parts, and runs ten times on a different 90%-training-data/10%-test-data split, yielding an average accuracy and standard error across the ten test sets. This training methodology is very useful for 7 One anonymous reviewer raised the concern that we do not test on verbs that were unseen by the authors prior to finalizing the specific features to count. However, this does not reduce the generality of our results. The features we use are motivated by linguistic theory, and derived from the set of thematic properties that discriminate the verb classes. It is therefore very unlikely that they are skewed to the particular verbs we have chosen. Furthermore, our cross-validation experiments, described in the next subsection, show that our results hold across a very large number of randomly selected subsets of this sample of verbs.  Computational Linguistics Volume 27, Number 3 our application, as it yields performance measures across a large number of training data/test data sets, avoiding the problems of outliers in a single random selection from a relatively small data set such as ours.</Paragraph>
      <Paragraph position="2"> The second methodology is a single hold-out training and testing approach. Here, the system is run N times, where N is the size of the data set (i.e., the 59 verbs in our case), each time holding out a single data vector as the test case and using the remaining N-1 vectors as the training set. The single hold-out methodology yields an overall accuracy rate (when the results are averaged across all N trials), but also-unlike cross-validation--gives us classification results on each individual data vector. This property enables us to analyze differential performance on the individual verbs and across the different verb classes.</Paragraph>
      <Paragraph position="3"> Under both training and testing methodologies, the baseline (chance) performance in this task--a three-way classification--is 33.9%. In the single hold-out methodology, there are 59 test cases, with 20, 19, and 20 verbs each from the unergative, unaccusative, and object-drop classes, respectively. Chance performance of picking a single class label as a default and assigning it to all cases would yield at most 20 out of the 59 cases correct, or 33.9%. For the cross-validation methodology, the determination of a baseline is slightly more complex, as we are testing on a random selection of 10% of the full data set in each run. The 33.9% figure represents the expected relative proportion of a test set that would be labelled correctly by assignment of a default class label to the entire test set. Although the precise make-up of the test cases vary, on average the test set will represent the class membership proportions of the entire set of verbs. Thus, as with the single hold-out approach, chance accuracy corresponds to a maximum of 20/59, or 33.9%, of the test set being labelled correctly.</Paragraph>
      <Paragraph position="4"> The theoretical maximum accuracy for the task is, of course, 100%, although in Section 5 we discuss some classification results from human experts that indicate that a more realistic expectation is much lower (around 87%).</Paragraph>
    </Section>
    <Section position="2" start_page="387" end_page="389" type="sub_section">
      <SectionTitle>
4.2 Results Using 10-Fold Cross-Validation
</SectionTitle>
      <Paragraph position="0"> We first report the results of experiments using a training methodology of 10-fold cross-validation repeated 50 times. This means that the 10-fold cross-validation procedure is repeated for 50 different random divisions of the data. The numbers reported are the averages of the results over all the trials. That is, the average accuracy and standard error from each random division of the data (a single cross-validation run including 10 training and test sets) are averaged across the 50 different random divisions. This large number of experimental trials gives us a very tight bound on the mean accuracy reported, enabling us to determine with high confidence the statistical significance of differences in results.</Paragraph>
      <Paragraph position="1"> Table 8 shows that performance of classification using individual features varies greatly, from little above the baseline to almost 22% above the baseline, or a reduction of a third of the error rate, a very good result for a single feature. (All reported accuracies in Table 8 are statistically distinct, at the p &lt; .01 level, using an ANOVA \[dr = 249, F = 334.72\], with a Tukey-Kramer post test.) The first line of Table 9 shows that the combination of all features achieves an accuracy of 69.8%, which is 35.9% over the baseline, for a reduction in the error rate of 54%. This is a rather considerable result, given the very low baseline (33.9%). Moreover, recall that our training and testing sets are always disjoint (cf., Lapata and Brew \[1999\]; Siegel \[1999\]); in other words, we are predicting the classification of verbs that were never seen in the training corpus, the hardest situation for a classification algorithm.</Paragraph>
      <Paragraph position="2"> The second through sixth lines of Table 9 show the accuracy achieved on each subset of features that results from removing a single feature. This allows us to evaluate  1. TRANS PASS VBN CAUS ANIM 69.8 .5 2. TRANS VBN CAUS ANIM PASS 69.8 .5 3. TRANS PASS VBN ANIM CAUS 67.3 .6 4. TRANS PASS CAUS ANIM VBN 66.5 .5 5. TRANS PASS VBN CAUS ANIM 63.2 .6 6. PASS VBN CAUS ANIM TRANS 61.6 .6  the contribution of each feature to the performance of the classification process, by comparing the performance of the subset without it, to the performance using the full set of features. We see that the removal of PASS (second line) has no effect on the results, while removal of the remaining features yields a 2-8% decrease in performance. (In Table 9, the differences between all reported accuracies are statistically significant, at the p &lt; .05 level, except for between lines 1 and 2, lines 3 and 4, and lines 5 and 6, using an ANOVA \[dr = 299, F = 37.52\], with a Tukey-Kramer post test.) We observe that the behavior of the features in combination cannot be predicted by the individual feature behavior. For example, CAUS, which is the best individually, does not greatly affect accuracy when combined with the other features (compare line 3 to line 1). Conversely, ANIM and TRANS, which do not classify verbs accurately when used alone, are the most relevant in a combination of features (compare lines 5 and 6 to line 1). We conclude that experimentation with combinations of features is required to determine the relevance of individual features to the classification task.</Paragraph>
      <Paragraph position="3"> The general behaviour in classification based on individual features and on size 4 and size 5 subsets of features is confirmed for all subsets. Appendix B reports the results for all subsets of feature combinations, in order of decreasing performance. Table 10 summarizes this information. In the first data column, the table illustrates the average accuracy across all subsets of each size. The second through sixth data columns report the average accuracy of all the size n subsets in which each feature occurs. For example, the second data cell in the second row (54.9) indicates the average accuracy of all subsets of size 2 that contain the feature VBN. The last row of the table indicates the average accuracy for each feature of all subsets containing that feature.</Paragraph>
      <Paragraph position="4">  in all subsets. Looking at the first data column of Table 10, we can observe that, on average, larger sets of features perform better than smaller sets. Furthermore, as can be seen in the following individual feature columns, individual features perform better in a bigger set than in a smaller set, without exception. The second observation--that the performance of individual features is not always a predictor of their performance in combination--is confirmed by comparing the average performance of each feature in subsets of different sizes to the average across all subsets of each size. We can observe, for instance, that the feature CAUS, which performs very well alone, is average in feature combinations of size 3 or 4. By contrast, the feature ANIM, which is the worst if used alone, is very effective in combination, with above average performance for all subsets of size 2 or greater.</Paragraph>
    </Section>
    <Section position="3" start_page="389" end_page="391" type="sub_section">
      <SectionTitle>
4.3 Results Using Single Hold-Out Methodology
</SectionTitle>
      <Paragraph position="0"> One of the disadvantages of the cross-validation training methodology, which averages performance across a large number of random test sets, is that we do not have performance data for each verb, nor for each class of verbs. In another set of experiments, we used the same C5.0 system, but employed a single hold-out training and testing methodology. In this approach, we hold out a single verb vector as the test case, and train the system on the remaining 58 cases. We then test the resulting classifier on the single hold-out case, and record the assigned class for that verb. This procedure is repeated for each of the 59 verbs. As noted above, the single hold-out methodology has the benefit of yielding both classification results on each individual verb, and an overall accuracy rate (the average results across all 59 trials). Moreover, the results on individual verbs provide the data necessary for determining accuracy for each verb class. This allows us to determine the contribution of individual features as above, but with reference to their effect on the performance of individual classes. This is important, as it enables us to evaluate our hypotheses concerning the relation between the thematic features and verb class distinctions, which we turn to in Section 4.4.</Paragraph>
      <Paragraph position="1"> We performed single hold-out experiments on the full set of features, as well as on each subset of features with a single feature removed. The first line of Table 11 shows that the overall accuracy for all five features is almost exactly the same as that achieved with the 10-fold cross-validation methodology (69.5% versus 69.8%). As with the cross-validation results, the removal of PASS does not degrade performance--in fact, here its removal appears to improve performance (see line 2 of Table 11). However, it should be noted that this increase in performance results from one additional verb being  1. TRANS PASS VBN CAUS ANIM 69.5 2. TRANS VBN CAUS ANIM PASS 71.2 3. TRANS PASS VBN ANIM CAUS 62.7 4. TRANS PASS CAUS AN1M VBN 61.0 5. TRANS PASS VBN CAUS ANIM 61.0 6. PASS VBN CAUS ANIM TRANS 64.4  1. TRANS PASS VBN CAUS ANIM 73.9 68.6 2. TRANS VBN CAUS ANIM PASS 76.2 75.7 3. TRANS PASS VBN ANIM CAUS 65.1 60.0 4. TRANS PASS CAUS ANIM VBN 66.7 65.0 5. TRANS PASS VBN CAUS AN1M 72.7 47.0 6. PASS VBN CAUS ANIM TRANS 78.1 51.5  classified correctly. The remaining lines of Table 11 show that the removal of any other feature has a 5-8% negative effect on performance, again similar to the cross-validation results. (Although note that the precise accuracy achieved is not the same in each case as with 10-fold cross-validation, indicating that there is some sensitivity to the precise make-up of the training set when using a subset of the features.) Table 12 presents the results of the single hold-out experiments in terms of performance within each class, using an F measure with balanced precision and recall. 8 The first line of the table shows clearly that, using all five features, the unergatives are classified with greater accuracy (F = 73.9%) than the unaccusative and object-drop verbs (F scores of 68.6% and 64.9%, respectively). The features appear to be better at distinguishing unergatives than the other two verb classes. The remaining lines of Table 12 show that this pattern holds for all of the subsets of features as well. Clearly, future work on our verb classification task will need to focus on determining features that better discriminate unaccusative and object-drop verbs.</Paragraph>
      <Paragraph position="2"> One potential explanation that we can exclude is that the pattern of results is due simply to the frequencies of the verbs--that is, that more frequent verbs are more accurately classified. We examined the relation between classification accuracy and log 8 For all previous results, we reported an accuracy measure (the percentage of correct classifications out of all classifications). Using the terminology of true or false positives/negatives, this is the same as truePositives/(truePositives + falseNegafives). In the earlier results, there are no falsePositives or trueNegatives, since we are only considering for each verb whether it is correctly classified (truePositive) or not (falseNegative). However, when we turn to analyzing the data for each class, the possibility arises of having falsePositives and trueNegatives for that class. Hence, here we use the balanced F score, which calculates an overall measure of performance as 2PR/(P + R), in which P (precision) is truePositives/(truePositives + falsePositives), and R (recall) is truePositives/(truePositives + falseNegatives).</Paragraph>
      <Paragraph position="3">  Computational Linguistics Volume 27, Number 3 frequencies of the verbs, both by class and individually. By class, unergatives have the lowest average log frequency (1.8), but are the best classified, while unaccusatives and object-drops are comparable (average log frequency = 2.4). If we group individual verbs by frequency, the proportion of errors to the total number of verbs is not linearly related to frequency (log frequency K 2:7 errors/24 verbs, or 29% error; log frequency between 2 and 3:7 errors/25 verbs, or 28% error; log frequency &gt; 3:4 errors/10 verbs, or 40% error). Moreover, it seems that the highest-frequency verbs pose the most problems to the program. In addition, the only verb of log frequency K 1 is correctly classified, while the only one with log frequency &gt; 4 is not. In conclusion, we do not find that there is a simple mapping from frequency to accuracy. In particular, it is not the case that more frequent classes or verbs are more accurately classified.</Paragraph>
      <Paragraph position="4"> One factor possibly contributing to the poorer performance on unaccusatives and object-drops is the greater degree of error in the automatic counting procedures for these verbs, which we discussed in Section 3.2. In addition to exploration of other linguistic features, another area of future work is to develop better search patterns, for transitivity and passive in particular. Unfortunately, one limiting factor in automatic counting is that we inherit the inevitable errors in POS tags in an automatically tagged corpus. For example, while the unergative verbs are classified highly accurately, we note that two of the three errors in misclassifying unergatives (galloped and paraded) are due to a high degree of error in tagging. 9 The verb galloped is incorrectly tagged VBN instead of VBD in all 12 of its uses in the corpus, and the verb paraded is incorrectly tagged VBN instead of VBD in 13 of its 33 uses in the corpus. After correcting only the VBN feature of these two verbs to reflect the actual part of speech, overall accuracy in classification increases by almost 10%, illustrating the importance of both accurate counts and accurate annotation of the corpora.</Paragraph>
    </Section>
    <Section position="4" start_page="391" end_page="394" type="sub_section">
      <SectionTitle>
4.4 Contribution of the Features to Classification
</SectionTitle>
      <Paragraph position="0"> We can further use the single hold-out results to determine the contribution of each feature to accuracy within each class. We do this by comparing the class labels assigned using the full set of five features (TRANS, PASS, VBN, CAUS, ANIM) with the class labels assigned using each size 4 subset of features. The difference in classifications between each four-feature subset and the full set of features indicates the changes in class labels that we can attribute to the added feature in going from the four-feature to five-feature set. Thus, we can see whether the features indeed contribute to discriminating the classes in the manner predicted in Section 2.2, and summarized here in Table 13.</Paragraph>
      <Paragraph position="1"> We illustrate the data with a set of confusion matrices, in Tables 14 and 15, which show the pattern of errors according to class label for each set of features. In each confusion matrix, the rows indicate the actual class of incorrectly classified verbs, and the columns indicate the assigned class. For example, the first row of the first panel of Table 14 shows that one unergative was incorrectly labelled as unaccusative, and two unergatives as object-drop. To determine the confusability of any two classes (the 9 The third error in classification of unergatives is the verb floated, which we conjecture is due not to counting errors, but to the linguistic properties of the verb itself. The verb is unusual for a manner-of-motion verb in that the action is inherently &amp;quot;uncontrolled&amp;quot;, and thus the subject of the intransitive/object of the transitive is a more passive entity than with the other unergatives (perhaps indicating that the inventory of thematic roles should be refined to distinguish activity verbs with less agentive subjects). We think that this property relates to the notion of internal and external causation that is an important factor in distinguishing unergative and unaccusative verbs. We refer the interested reader to Stevenson and Merlo (1997b), which discusses the latter issue in more detail.  opposite of discriminability), we look at two cells in the matrix: the one in which verbs of the first class were assigned the label of the second class, and the one in which verbs of the second class were assigned the label of the first class. (These pairs of cells are those opposite the diagonal of the confusion matrix.) By examining the decrease (or increase) in confusability of each pair of classes in going from a four-feature experiment to the five-feature experiment, we gain insight into how well (or how poorly) the added feature helps to discriminate each pair of classes.</Paragraph>
      <Paragraph position="2"> An analysis of the confusion matrices reveals that the behavior of the features largely conforms to our linguistic predictions, leading us to conclude that the features  Computational Linguistics Volume 27, Number 3 we counted worked largely for the reasons we had hypothesized. We expected CAUS and ANIM to be particularly helpful in identifying unaccusatives, and these predictions are confirmed. Compare the second to the first panel of Table 14 (the errors without the CAUS feature compared to the errors with the C/AUS feature added to the set).</Paragraph>
      <Paragraph position="3"> We see that, without the CAUS feature, the confusability between unaecusatives and unergatives, and between unaccusatives and object-drops, is 9 and 7 errors, respectively; but when CAUS is added to the set of features, the confusability between these pairs of classes drops substantially, to 5 and 6 errors, respectively. On the other hand, the confusability between unergatives and object-drops becomes slightly worse (errors increasing from 6 to 7). The latter indicates that the improvement in unaccusatives is not simply due to an across-the-board improvement in accuracy as a result of having more features. We see a similar pattern with the ANIM feature. Comparing the third to the first panel of Table 14 (the errors without the ANIM feature compared to the errors with the ANIM feature added to the set), we see an even larger improvement in discriminability of unaccusatives when the ANIM feature is added. The confusability of unaccusatives and unergatives drops from 7 errors to 5 errors, and of unaccusatives and object-drops from 11 errors to 6 errors. Again, confusability of unergatives and object-drops is worse, with an increase in errors of 5 to 7.</Paragraph>
      <Paragraph position="4"> We had predicted that the TRANS feature would make a three-way distinction among the verb classes, based on its predicted linear relationship between the classes (see the inequalities in Table 13). We had further expected that PASS and VBN would behave similarly, since these features are correlated to TRANS. To make a three-way distinction among the verb classes, we would expect confusability between all three pairs of verb classes to decrease (i.e., discriminability would improve) with the addition of TRANS, PASS, or VBN. We find that these predictions are confirmed in part.</Paragraph>
      <Paragraph position="5"> First consider the TRANS feature. Comparing the second to the first panel of Table 15, we find that unergatives are already accurately classified, and the addition of TRANS to the set does indeed greatly reduce the confusability of unaccusatives and object-drops, with the number of errors dropping from 12 to 6. However, we also observe that the confusability of unergatives and unaccusatives is not improved, and the confusability of unergatives and object-drops is worsened with the addition of the TRANS feature, with errors in the latter case increasing from 4 to 7. We conclude that the expected three-way discriminability of TRANS is most apparent in the reduced confusion of unaccusative and object-drop verbs.</Paragraph>
      <Paragraph position="6"> Our initial prediction was that PASS and VBN would behave similarly to TRANS-that is, also making a three-way distinction among the classes--although the aggregate data revealed little difference in these feature values between unaccusatives and objectdrops. Comparing the third to the first panel of Table 15, we observe that the addition of the PAss feature hinders the discriminability of unergatives and unaccusatives (increasing errors from 2 to 5); it does help in discriminating the other pairs of classes, but only slightly (reducing the number of errors by 1 in each case). The VBN feature shows a similar pattern, but is much more helpful at distinguishing unergatives from object-drops, and object-drops from unaccusatives. In comparing the fourth to the first panel of Table 15, we find that the confusability of unergatives and object-drops is reduced from 9 errors to 7, and of unaccusatives and object-drops from 10 errors to 6. The latter result is somewhat surprising, since the aggregate VBN data for the unaccusative and object-drop classes are virtually identical. We conclude that contribution of a feature to classification is not predictable from the apparent discriminability of its numeric values across the classes. This observation emphasizes the importance of an experimental method to evaluating our approach to verb classification. null</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="394" end_page="394" type="metho">
    <SectionTitle>
5. Establishing the Upper Bound for the Task
</SectionTitle>
    <Paragraph position="0"> In order to evaluate the performance of the algorithm in practice, we need to compare it to the accuracy of classification performed by an expert, which gives a realistic upper bound for the task. The lively theoretical debate on class membership of verbs, and the complex nature of the linguistic information necessary to accomplish this task, led us to believe that the task is difficult and not likely to be performed at 100% accuracy even by experts, and is also likely to show differences in classification between experts.</Paragraph>
    <Paragraph position="1"> We report here the results of two experiments which measure expert accuracy in classifying our verbs (compared to Levin's classification as the gold standard), as well as inter-expert agreement. (See also Merlo and Stevenson \[2000a\] for more details.) To enable comparison of responses, we performed a closed-form questionnaire study, where the number and types of the target classes are defined in advance, for which we prepared a forced-choice and a non-forced-choice variant. The forced-choice study provides data for a maximally restricted experimental situation, which corresponds most closely to the automatic verb classification task. However, we are also interested in slightly more natural results--provided by the non-forced-choice task--where the experts can assign the verbs to an &amp;quot;others&amp;quot; category.</Paragraph>
    <Paragraph position="2"> We asked three experts in lexical semantics (all native speakers of English) to complete the forced-choice electronic questionnaire study. Neither author was among the three experts, who were all professionals in computational or theoretical linguistics with a specialty in lexical semantics. Materials consisted of individually randomized lists of the same 59 verbs used for the machine learning experiments, using Levin's (1993) electronic index, available from Chicago University Press. The verbs were to be classified into the three target classes--unergative, unaccusative, and object-drop-which were described in the instructions. 1deg (All materials and instructions are available at URL http://www.latl.unige.ch/latl/personal/paola.html.) Table 16 shows an analysis of the results, reporting both percent agreement and pairwise agreement (according to the Kappa statistic) among the experts and the program. ~1 Assessing the percentage of verbs on which the experts agree gives us</Paragraph>
  </Section>
  <Section position="8" start_page="394" end_page="397" type="metho">
    <SectionTitle>
10 The definitions of the classes were as follows. Unergative: A verb that assigns an agent theta role to the
</SectionTitle>
    <Paragraph position="0"> subject in the intransitive. If it is able to occur transitively, it can have a causative meaning.</Paragraph>
    <Paragraph position="1"> Unaccusative: A verb that assigns a patient/theme theta role to the subject in the intransitive. When it occurs transitively, it has a causative meaning. Object-Drop: A verb that assigns an agent role to the subject and patient/theme role to the object, which is optional. When it occurs transitively, it does not have a causative meaning. 11 In the comparison of the program to the experts, we use the results of the classifier under single hold-out training--which yields an accuracy of 69.5%--because those results provide the classification for each of the individual verbs.</Paragraph>
    <Paragraph position="2">  Computational Linguistics Volume 27, Number 3 an intuitive measure. However, this measure does not take into account how much the experts agree over the expected agreement by chance. The latter is provided by the Kappa statistic, which we calculated following Klauer (1987, 55-57) (using the z distribution to determine significance; p ~ 0.001 for all reported results). The Kappa value measures the experts', and our classifier's, degree of agreement over chance, with the gold standard and with each other. Expected chance agreement varies with the number and the relative proportions of categories used by the experts. This means that two given pairs of experts might reach the same percent agreement on a given task, but not have the same expected chance agreement, if they assigned verbs to classes in different proportions. The Kappa statistic ranges from 0, for no agreement above chance, to 1, for perfect agreement. The interpretation of the scale of agreement depends on the domain, like all correlations. Carletta (1996) cites the convention from the domain of content analysis indicating that .67 K K &lt; .8 indicates marginal agreement, while K &gt; .8 is an indication of good agreement. We can observe that only one of our agreement figures comes close to reaching what would be considered &amp;quot;good&amp;quot; under this interpretation. Given the very high level of expertise of our human experts, we suspect then that this is too stringent a scale for our task, which is qualitatively quite different from content analysis.</Paragraph>
    <Paragraph position="3"> Evaluating the experts' performance summarized in Table 16, we can remark two things, which confirm our expectations. First, the task is difficult--i.e., not performed at 100% (or close) even by trained experts, when compared to the gold standard, with the highest percent agreement with Levin at 86.5%. Second, with respect to comparison of the experts among themselves, the rate of agreement is never very high, and the variability in agreement is considerable, ranging from .53 to .66. This evaluation is also supported by a 3-way agreement measure (Siegel and Castellan 1988). Applying this calculation, we find that the percentage of verbs to which the three experts gave the same classification (60%, K = 0.6) is smaller than any of the pairwise agreements, indicating that the experts do not all agree on the same subset of verbs.</Paragraph>
    <Paragraph position="4"> The observation that the experts often disagree on this difficult task suggests that a combination of expert judgments might increase the upper bound. We tried the simplest combination, by creating a new classification using a majority vote: each verb was assigned the label given by at least two experts. Only three cases did not have any majority label; in these cases we used the classification of the most accurate expert. This new classification does not improve the upper bound, reaching only 86.4% (K = .80) compared to the gold standard.</Paragraph>
    <Paragraph position="5"> The evaluation is also informative with respect to the performance of the program.</Paragraph>
    <Paragraph position="6"> On the one hand, we observe that if we take the best performance achieved by an expert in this task---86.5%--as the maximum achievable accuracy in classification, our algorithm then reduces the error rate over chance by approximately 68%, a very respectable result. In fact, the accuracy of 69.5% achieved by the program is only 1.5% less than one of the human experts in comparison to the gold standard. On the other hand, the algorithm still does not perform at expert level, as indicated by the fact that, for all experts, the lowest agreement score is with the program.</Paragraph>
    <Paragraph position="7"> One interesting question is whether experts and program disagree on the same verbs, and show similar patterns of errors. The program makes 18 errors, in total, compared to the gold standard. However, in 9 cases, at least one expert agrees with the classification given by the program. The program makes fewer errors on unergatives (3) and comparably many on unaccusatives and object-drops (7 and 8 respectively), indicating that members of the latter two classes are quite difficult to classify. This differs from the pattern of average agreement between the experts and Levin, who agree on 17.7 (of 20) unergatives, 16.7 (of 19) unaccusatives, and 11.3 (of 20) object-drops. This  Merlo and Stevenson Statistical Verb Classification clearly indicates that the object-drop class is the most difficult for the human experts to define. This class is the most heterogeneous in our verb list, consisting of verbs from several subclasses of the &amp;quot;unexpressed object alternation&amp;quot; class in (Levin, 1993). We conclude that the verb classification task is likely easier for very homogeneous classes, and more difficult for more broadly defined classes, even when the exemplars share the critical syntactic behaviors.</Paragraph>
    <Paragraph position="8"> On the other hand, frequency does not appear to be a simple factor in explaining patterns of agreement between experts, or increases in accuracy. As in Section 4.3, we again analyze the relation between log frequency of the verbs and classification performance, here considering the performance of the experts. We grouped verbs in three log frequency classes: verbs with log frequency less than 2 (i.e., frequency less than 100), those with log frequency between 2 and 3 (i.e., frequency between 100 and 1000), and those with log frequency over 3 (i.e., frequency over 1000). The low-frequency group had 24 verbs (14 unergatives, 5 unaccusatives, and 5 object-drop), the intermediate-frequency group had 25 verbs (5 unergatives, 9 unaccusatives, and 11 object-drops), and the high-frequency group had 10 verbs (1 unergative, 5 unaccusatives, and 4 object-drops). We found that verbs with high and low frequency yield better accuracy and agreement among the experts than the verbs with mid frequency.</Paragraph>
    <Paragraph position="9"> Neither the accuracy of the majority classification, nor the accuracy of the expert that had the best agreement with Levin, were linearly affected by frequency. For the majority vote, verbs with frequency less than 100 yield an accuracy of 92%, K = .84; verbs with frequency between 100 and 1000, accuracy 80%, K = .69; and for verbs with frequency over 1000, accuracy 90%, K = .82. For the &amp;quot;best&amp;quot; expert, the pattern is similar: verbs with frequency less than 100 yield an accuracy of 87.5%, K = .74; verbs with frequency between 100 and 1000, accuracy 84%, K = .76; and verbs with frequency over 1000, accuracy 90%, K = .82.</Paragraph>
    <Paragraph position="10"> We can see here that different frequency groups yield different classification behavior. However, the relation is not simple, and it is clearly affected by the composition of the frequency group: the middle group contains mostly unaccusative and object-drop verbs, which are the verbs with which our experts have the most difficulty. This confirms that the class of the verb is the predominant factor in their pattern of errors. Note also that the pattern of accuracy across frequency groupings is not the same as that of the program (see Section 4.3, which revealed the most errors by the program on the highest frequency verbs), again indicating qualitative differences in performance between the program and the experts.</Paragraph>
    <Paragraph position="11"> Finally, one possible shortcoming of the above analysis is that the forced-choice task, while maximally comparable to our computational experiments, may not be a natural one for human experts. To explore this issue, we asked two different experts in lexical semantics (one native speaker of English and one bilingual) to complete the non-forced-choice electronic questionnaire study; again, neither author served as one of the experts. In this task, in addition to the three verb classes of interest, an answer of &amp;quot;other&amp;quot; was allowed. Materials consisted of individually randomized lists of 119 target and filler verbs taken from Levin's (1993) electronic index, as above. The targets were again the same 59 verbs used for the machine learning experiments. To avoid unwanted priming of target items, the 60 fillers were automatically selected from the set of verbs that do not share any class with any of the senses of the 59 target verbs in Levin's index. In this task, if we take only the target items into account, the experts agreed 74.6% of the time (K = 0.64) with each other, and 86% (K = 0.80) and 69% (K = 0.57) with the gold standard. (If we take all the verbs into consideration, they agreed in 67% of the cases \[K = 0.56\] with each other, and 68% \[K = 0.55\] and 60.5% \[K = 0.46\] with the gold standard, respectively.) These results show that the forced- null Computational Linguistics Volume 27, Number 3 choice and non-forced-choice task are comparable in accuracy of classification and inter-judge agreement on the target classes, giving us confidence that the forced-choice results provide a reasonably stable upper bound for computational experiments.</Paragraph>
  </Section>
  <Section position="9" start_page="397" end_page="399" type="metho">
    <SectionTitle>
6. Discussion
</SectionTitle>
    <Paragraph position="0"> The work presented here contributes to some central issues in computational linguistics, by providing novel insights, data, and methodology in some cases, and by reinforcing some previously established results in others. Our research stems from three main hypotheses: .</Paragraph>
    <Paragraph position="1"> .</Paragraph>
    <Paragraph position="2"> .</Paragraph>
    <Paragraph position="3"> Argument structure is the relevant level of representation for verb classification.</Paragraph>
    <Paragraph position="4"> Argument structure is manifested distributionally in syntactic alternations, giving rise to differences in subcategorization frames or the distributions of their usage, or in the properties of the NP arguments to a verb.</Paragraph>
    <Paragraph position="5"> This information is detectable in a corpus and can be learned automatically.</Paragraph>
    <Paragraph position="6"> We discuss the relevant debate on each of these hypotheses, and the contribution of our results to each, in the following subsections.</Paragraph>
    <Section position="1" start_page="397" end_page="397" type="sub_section">
      <SectionTitle>
6.1 Argument Structure and Verb Classification
</SectionTitle>
      <Paragraph position="0"> Argument structure has previously been recognized as one of the most promising candidates for accurate classification. For example, Basili, Pazienza, and Velardi (1996) argue that relational properties of verbs--their argument structure--are more informative for classification than their definitional properties (e.g., the fact that a verb describes a manner of motion or a way of cooking). Their arguments rest on linguistic and psycholinguistic results on classification and language acquisition (in particular, Pinker, \[1989\]; Rosch \[1978\]).</Paragraph>
      <Paragraph position="1"> Our results confirm the primary role of argument structure in verb classification.</Paragraph>
      <Paragraph position="2"> Our experimental focus is particularly clear in this regard because we deal with verbs that are &amp;quot;minimal pairs&amp;quot; with respect to argument structure. By classifying verbs that show the same subcategorizations (transitive and intransitive) into different classes, we are able to eliminate one of the confounds in classification work created by the fact that subcategorization and argument structure are often co-variant. We can infer that the accuracy in our classification is due to argument structure information, as subcategorization is the same for all verbs. Thus, we observe that the content of the thematic roles assigned by a verb is crucial for classification.</Paragraph>
    </Section>
    <Section position="2" start_page="397" end_page="398" type="sub_section">
      <SectionTitle>
6.2 Argument Structure and Distributional Statistics
</SectionTitle>
      <Paragraph position="0"> Our results further support the assumption that thematic differences across verb classes are apparent not only in differences in subcategorization frames, but also in differences in their frequencies. This connection relies heavily on the hypothesis that lexical semantics and lexical syntax are correlated, following Levin (1985; 1993). However, this position has been challenged by Basili, Pazienza, and Velardi (1996) and Boguraev and Briscoe (1989), among others. For example, in an attempt to assess the actual completeness and usefulness of the Longman Dictionary of Contemporary English (LDOCE) entries, Boguraev and Briscoe (1989) found that people assigned a  Merlo and Stevenson Statistical Verb Classification &amp;quot;change of possession&amp;quot; meaning both to verbs that had dative-related subcategorization frames (as indicated in the LDOCE) and to verbs that did not. Conversely, they also found that both verbs that have a change-of-possession component in their meaning and those that do not could have a dative code. They conclude that the thesis put forth by Levin (1985) is only partially supported. Basili, Pazienza, and Velardi (1996) show further isolated examples meant to illustrate that lexical syntax and semantics are not in a one-to-one relation.</Paragraph>
      <Paragraph position="1"> Many recent results, however, seem to converge in supporting the view that the relation between lexical syntax and semantics can be usefully exploited (Aone and McKee 1996; Dorr 1997; Dorr, Garman, and Weinberg 1995; Dorr and Jones 1996; Lapata and Brew 1999; Schulte im Walde 2000; Siegel 1998; Siegel 1999). Our work in particular underscores the relation between the syntactic manifestations of argument structure, and lexical semantic class. In light of these recent successes, the conclusions in Boguraev and Briscoe (1989) are clearly too pessimistic. In fact, their results do not contradict the more recent ones. First of all, it is not the case that if an implication holds from argument structure to subcategorization (change of possession implies dative shift), the converse also holds. It comes as no surprise that verbs that do not have any change-of-possession component in their meaning may also show dative shift syntactically. Secondly, as Boguraev and Briscoe themselves note, Levin's statement should be interpreted as a statistical trend, and as such, Boguraev and Briscoe's results also confirm it. They claim however, that in adopting a statistical point of view, predictive power is lost. Our work shows that this conclusion is not appropriate either: the correlation is strong enough to be useful to predict semantic classification, at least for the argument structures that have been investigated.</Paragraph>
    </Section>
    <Section position="3" start_page="398" end_page="399" type="sub_section">
      <SectionTitle>
6.3 Detection of Argument Structure in Corpora
</SectionTitle>
      <Paragraph position="0"> Given the manifestation of argument structure in statistical distributions, we view corpora, especially if annotated with currently available tools, as repositories of implicit grammars, which can be exploited in automatic verb-classification tasks. Besides establishing a relationship between syntactic alternations and underlying semantic properties of verbs, our approach extends existing corpus-based learning techniques to the detection and automatic acquisition of argument structure. To date, most work in this area has focused on learning of subcategorization from unannotated or syntactically annotated text (e.g., Brent \[1993\]; Sanfilippo and Poznanski \[1992\]; Manning \[1993\]; Collins \[1997\]). Others have tackled the problem of lexical semantic classification, but using only subcategorization frequencies as input data (Lapata and Brew 1999; Schulte im Walde 2000). Specifically, these researchers have not explicitly addressed the definition of features to tap directly into thematic role differences that are not reflected in subcategorization distinctions. On the other hand, when learning of thematic role assignment has been the explicit goal, the text has been semantically annotated (Webster and Marcus 1989), or external semantic resources have been consulted (Aone and McKee 1996; McCarthy 2000). We extend these results by showing that thematic information can be induced from linguistically-guided counts in a corpus, without the use of thematic role tagging or external resources such as WordNet.</Paragraph>
      <Paragraph position="1"> Finally, our results converge with the increasing agreement that corpus-based techniques are fruitful in the automatic construction of computational lexicons, providing machine readable dictionaries with complementary, reusable resources, such as frequencies of argument structures. Moreover, these techniques produce data that is easily updated, as the information contained in corpora changes all the time, allowing for adaptability to new domains or usage patterns. This dynamic aspect could be exploited if techniques such as the one presented here are developed, which can work  Computational Linguistics Volume 27, Number 3 on a rough collection of texts, and do not require a carefully balanced corpus or time-consuming semantic tagging.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML