File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-0401_metho.xml
Size: 26,396 bytes
Last Modified: 2025-10-06 14:09:04
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-0401"> <Title>implementation, and use of the Ngram Statistic Package. In Proceedings of the Fourth International Conference on Intelligent Text Processing and Computational Linguistics.</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Our Proposal </SectionTitle> <Paragraph position="0"> The initial goal in our investigation of semi-productivity is to find a means for determining how well particular light verbs and complements go together. We focus on the &quot;LV a V&quot; constructions because we are interested in the hypothesis that the complement to the LV is a verb, and think that the properties of this construction may place interesting restrictions on what forms a valid LVC.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Generalizing over Verb Classes </SectionTitle> <Paragraph position="0"> As noted above, there are constraints in an &quot;LV a V&quot; construction on which complements can occur with particular light verbs. Moreover, similar potential complements pattern alike in this regard-that is, semantically similar complements may have the same pattern of co-occurrence across different light verbs. Since the complement is hypothesized to be a verbal element, we look to verb classes to capture the relevant semantic similarity. The lexical semantic classes of Levin (1993) have been used as a standard verb classification within the computational linguistics community. We thus propose using these classes as the semantically similar groups over which to compare acceptability of potential complements with a given light verb.2 Our approach is related to the idea of substitutability in multiword expressions. Substituting pieces of a multiword expression with semantically similar words from a thesaurus can be used to determine productivity--higher degree of substitutability indicating higher productivity (Lin, 1999; Mc-Carthy et al., 2003).3 Instead of using a thesaurus-based measure, Villavicencio (2003) uses substitutability over semantic verb classes to determine potential verb-particle combinations.</Paragraph> <Paragraph position="1"> Our method is somewhat different from these earlier approaches, not only in focusing on LVCs, but in the precise goal. While Villavicencio (2003) uses verb classes to generalize over verbs and then confirms whether an expression is attested, we seek to determine how good an expression is. Specifically, we aim to develop a computational approach not only for characterizing the set of complements that can occur with a given light verb in these LVCs, but also to quantify the acceptability.</Paragraph> <Paragraph position="2"> In investigating light verbs and their combination with complements from various verb semantic classes, we expect that these LVCs are not fully idiosyncratic, but exhibit systematic behaviour. Most importantly, we hypothesize that they show class-based behaviour--i.e., that the same light verb will show distinct patterns of acceptability with complements across different verb classes. We also ex2We also need to compare generalizability over semantic noun classes to further test the linguistic hypothesis. We initially performed such experiments on noun classes in Word-Net, but, due to the difficulty of deciding an appropriate level of generalization in the hierarchy, we left this as future work. 3Note that although Lin characterizes his work as detecting non-compositionality, we agree with Bannard et al. (2003) that it is better thought of as tapping into productivity.</Paragraph> <Paragraph position="3"> plore whether the light verbs themselves show different patterns in terms of how they are used semiproductively in these constructions.</Paragraph> <Paragraph position="4"> We choose to focus on the light verbs take, give, and make. We choose take and give because they seem similar in their ability to occur in a range of LVCs, and yet they have almost the opposite semantics. We hope that the latter will reveal interesting patterns in occurrence with the different verb classes. On the other hand, make seems very different from both take and give. It seems much less restrictive in its combinations, and also seems difficult to distinguish in terms of light versus &quot;heavy&quot; uses. We expect it to show different generalization behaviour from the other two light verbs.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Devising an Acceptability Measure </SectionTitle> <Paragraph position="0"> Given the experimental focus, we must devise a method for determining acceptability of LVCs. One possibility is to use a standard measure for detecting collocations, such as pointwise mutual information (Church et al., 1991). &quot;LV a V&quot; constructions are well-suited to collocational analysis, as the light verb can be seen as the first component of a collocation, and the string &quot;a V&quot; as the second component. Applying this idea to potential LVCs, we calculate pointwise mutual information, I(lv; aV).</Paragraph> <Paragraph position="1"> In addition, we use the linguistic properties of the &quot;LV a V&quot; construction to develop a more informed measure. As noted in Section 2, generally only the indefinite determiner a (or an) is allowed in this type of LVC. We hypothesize then that for a &quot;good&quot; LVC, we should find a much higher mutual information value for &quot;LV a V&quot; than for &quot;LV [det] V&quot;, where [det] is any determiner other than the indefinite. While I(lv; aV) should tell us whether &quot;LV a V&quot; is a good collocation (Church et al., 1991), the difference between the two, I(lv; aV) - I(lv; detV), should tell us whether the collocation is an LVC.</Paragraph> <Paragraph position="2"> To summarize, we assume that:</Paragraph> <Paragraph position="4"> &quot;LV a V&quot; is likely not a true LVC.</Paragraph> <Paragraph position="5"> In order to capture these two conditions in a single measure, we combine them by using a linear approximation to the two lines given by I(lv; aV) a2 0 and I(lv; aV) - I(lv; detV) a2 0. The most straight-forward line approximating the combined effect of these two conditions is:</Paragraph> <Paragraph position="7"> We hypothesize that this combined measure-- null '*' indicates a random subset of verbs in the class. ter with human ratings of the LVCs than the mutual information of the &quot;LV a V&quot; construction alone. For I(lv; detV), we explore several possible sets of determiners standing in for &quot;det&quot;, including the, this, that, and the possessive determiners. We find, contrary to the linguistic claim, that the is not always rare in &quot;LV a V&quot; constructions, and the measures excluding the perform best on development data.4</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Materials and Methods </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Experimental Classes </SectionTitle> <Paragraph position="0"> Three Levin classes are used for the development set, and four classes for the test set, as shown in Table 1. Each set of classes covers a range of LVC productivity with the light verbs take, give, and make, from classes in which we felt no LVCs were possible with a given LV, to classes in which many verbs listed seemed to form valid LVCs with a given LV.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Corpora </SectionTitle> <Paragraph position="0"> Even the 100M words of the British National Corpus (BNC Reference Guide, 2000) do not give an acceptable level of LVC coverage: a very common LVC such as take a stroll, for instance, is attested only 23 times. To ensure sufficient data to detect less common LVCs, we instead use the Web as our corpus (in particular, the subsection indexed by the Google search engine, http://www.google.com).</Paragraph> <Paragraph position="1"> Using the Web to overcome data sparseness has been attempted before (Keller et al., 2002); however, there are issues: misspellings, typographic errors, and pages in other languages all contribute to noise in the results. Moreover, punctuation is ig4Cf. I took the hike that was recommended. This finding supports a statistical corpus-based approach to LVCs, as their usage may be more nuanced than linguistic theory suggests. nored in Google searches, meaning that search results can cross phrase or sentence boundaries. For instance, an exact phrase search for &quot;take a cry&quot; would return a web page which had the text It was too much to take. A cry escaped his lips. When searching for an unattested LVC, these noisy results can begin to dominate. In ongoing work, we are devising some automatic clean-up methods to eliminate some of the false positives.</Paragraph> <Paragraph position="2"> On the other hand, it should be pointed out that not all &quot;good&quot; LVCs will appear in our corpus, despite its size. In this view we differ from Villavicencio (2003), who assumes that if a multiword expression is not found in the Google index, then it is not a good construction. As an example, consider The clown took a cavort across the stage. The LVC seems plausible; however, Google returns no results for &quot;took a cavort&quot;. This underlines the need for determining plausible (as opposed to attested) LVCs, which class-based generalization has the potential to support.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.3 Extraction </SectionTitle> <Paragraph position="0"> To measure mutual information, we gather several counts for each potential LVC: the frequency of the LVC (e.g., give a cry), the frequency of the light verb (e.g., give), and the frequency of the complement of the LVC (e.g., a cry). To achieve broader coverage, counts of the light verbs and the LVCs are collapsed across three tenses: the base form, the present, and the simple past. Since we are interested in the differences across determiners, we search for both the LVC (&quot;give [det] cry&quot;) and the complement alone (&quot;[det] cry&quot;) using all singular determiners. Thus, for each LVC, we require a number of LVC searches, as exemplified in Table 2, and analogous searches for &quot;[det] V&quot;.</Paragraph> <Paragraph position="1"> All searches were performed using an exact string search in Google, during a 24-hour period in March, 2004. The number of results returned is used as the frequency count. Note that this is an underestimate, since an LVC may occur than once in a single web page; however, examining each document to count the actual occurrences is infeasible, given the number of possible results. The size of the corpus (also needed in calculating our measures) is estimated at 5.6 billion, the number of hits returned in a search for &quot;the&quot;. This is also surely an underestimate, but is consistent with our other frequency counts.</Paragraph> <Paragraph position="2"> NSP is used to calculate pointwise mutual information over the counts (Banerjee and Pedersen, 2003).</Paragraph> </Section> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 Experimental Results </SectionTitle> <Paragraph position="0"> In these initial experiments, we compare human ratings of the target LVCs to several mutual information measures over our corpus counts, using Spearman rank correlation. We have two goals: to see whether these LVCs show differing behaviour according to the light verb and/or the verb class of the complement, and to determine whether we can indeed predict acceptability from corpus statistics.</Paragraph> <Paragraph position="1"> We first describe the human ratings, then the correlation results on our development and test data.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.1 Human Ratings </SectionTitle> <Paragraph position="0"> We use pilot results in which two native speakers of English rated each combination of &quot;LV a V&quot; in terms of acceptability. For the development classes, we used integer ratings of 1 (unacceptable) to 5 (completely natural), allowing for &quot;in-between&quot; ratings as well, such as 2.5. For the test classes, we set the top rating at 4, since we found that ratings up to 5 covered a larger range than seemed natural. The test ratings yielded linearly weighted Kappa values of .72, .39, and .44, for take, give, and make, respectively, and .53 overall.5 To determine a consensus rating, the human raters first discussed disagreements of more than one rating point. In the test data, this led to 6% of the ratings being changed. (Note that this is 6% of ratings, not 6% of verbs; fewer verbs were changed, since for some verbs both raters changed their rating after discussion.) We then simply averaged each pair of ratings to yield a single consensus rating for each item.</Paragraph> <Paragraph position="1"> In order to see differences in human ratings across the light verbs and the semantic classes of their complements, we put the (consensus) human ratings in bins of low (ratings a0 2) , medium (ratings a1 2, a0 3), and high (ratings a1 3). (Even a score of 2 meant that an LVC was &quot;ok&quot;.) Table 3 shows the distribution of medium and high scores for each of the light verbs and test classes. We can see that some classes generally allow more LVCs 5Agreement on the development set was much lower (linearly weighted Kappa values of .37, .23, and .56, for take, give, and make, respectively, and .38 overall), due to differences in interpretation of the ratings. Discussion of these issues by the raters led to more consistency in test data ratings.</Paragraph> <Paragraph position="2"> each LV and class. N is the number of test verbs.</Paragraph> <Paragraph position="3"> across the light verbs (e.g., 18.1,2) than others (e.g, 43.2). Furthermore, the light verbs show very different patterns of acceptability for different classes-e.g., give is fairly good with 43.2, while take is very bad, and the pattern is reversed for 51.4.2. In general, give allows more LVCs on the test classes than do the other two light verbs.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.2 Correlations with Statistical Measures </SectionTitle> <Paragraph position="0"> Our next step is to see whether the ratings, and the patterns across light verbs and classes, are reflected in the statistical measures over corpus data. Because our human ratings are not normally distributed (generally having a high proportion of values less than 2), we use the Spearman rank correlation coefficient a2 to compare the consensus ratings to the mutual information measures.6 As described in Section 3.2, we use pointwise mutual information over the &quot;LV a V&quot; string, as well as measures we developed that incorporate the linguistic observation that these LVCs typically do not occur with definite determiners. On our development set, we tested several of these measures and found that the following had the best correlations with human ratings:</Paragraph> <Paragraph position="2"> where I(lv; detV) is the mutual information over strings &quot;LV [det] V&quot;, and det is any determiner other than a, an, or the. Note that DiffAll is the most general of our combined measures; however, some verbs are not detected with other determiners, and thus DiffAll may apply to a smaller number of items than MI.</Paragraph> <Paragraph position="3"> We focus on the analysis of these two measures on test data, but the general patterns are the same 6Experiments on the development set to determine a threshold on the different measures to classify LVCs as good or not showed promise in their coarse match with human judgments. However, we set this work aside for now, since the correlation coefficients are more informative regarding the fine-grained match of the measures to human ratings, which cover a fairly wide range of acceptability.</Paragraph> <Paragraph position="4"> information measures and the consensus human ratings, on unseen test data. on the development set. Table 4 shows the correlation results on our unseen test LVCs. We get reasonably good correlations with the human ratings across a number of the light verbs and classes, indicating that these measures may be helpful in determining which light verb plus complement combinations form valid LVCs. In what follows, we examine more detailed patterns, to better analyze the data.</Paragraph> <Paragraph position="5"> First, comparing the test correlations to Table 3, we find that the classes with a low number of &quot;good&quot; LVCs have poor correlations. When we examine the correlation graphs, we see that, in general, there is a good correlation between the ratings greater than 1 and the corresponding measure, but when the rating is 1, there is often a wide range of values for the corpus-based measure. One cause could be noise in the data, as mentioned earlier--that is, for bad LVCs, we are picking up too many &quot;false hits&quot;, due to the limitations of using Google searches on the web. To confirm this, we examine one development class (10.4.1, the Wipe manner verbs), which was expected to be bad with take. We find a large number of hits for &quot;take a V&quot; that are not good LVCs, such as &quot;take a strip [of tape/of paper]&quot;, &quot;take a pluck[-and-play approach]&quot;. On the other hand, some examples with unexpectedly high corpus measures are LVCs the human raters were simply not aware of (&quot;take a skim through the manual&quot;), which underscores the difficulty of human rating of a semi-productive construction.</Paragraph> <Paragraph position="6"> Second, we note that we get very good correlations with take, somewhat less good correlations with give, and generally poor correlations with make. We had predicted that take and give would behave similarly (and the difference between take and give is less pronounced in the development data). We think one reason give has poorer correlations is that it was harder to rate (it had the highest proportion of disagreements), and so the human ratings may not be as consistent as for take. Also, for a class like 30.3, which we expected to be good with give (e.g., give a look, give a glance), we found that the LVCs were mostly good only in the dative form (e.g., give her a look, give it a glance). Since we only looked for exact matches to &quot;LV a V&quot;, we did not detect this kind of construction.</Paragraph> <Paragraph position="7"> We had predicted that make would behave differently from take and give, and indeed, except in one case, the correlations for make are poorer on the individual classes. Interestingly, the correlation overall attains a much better value using the mutual information of &quot;LV a V&quot; alone (i.e., the MI measure). We think that the pattern of correlations with make may be because it is not necessarily a &quot;true light verb&quot; construction in many cases, but rather a &quot;vague action verb&quot; (see Section 2). If so, its behaviour across the complements may be somewhat more arbitrary, combining different uses.</Paragraph> <Paragraph position="8"> Finally, we compare the combined measure DiffAll to the mutual information, MI, alone. We hypothesized that while the latter should indicate a collocation, the combined measure should help to focus on LVCs in particular, because of their linguistic property of occurring primarily with an indefinite determiner. On the individual classes, when considering correlations that are statistically significant or marginally so (i.e., at the confidence level of 90%), the DiffAll measure overall has somewhat stronger correlations than MI. Over all complement verbs together, DiffAll is roughly the same as MI for take; is somewhat better for give, and is worse for make.7 Better performance over the individual classes indicates that when applying the measures, at least to take and give, it is helpful to separate the data according to semantic verb class. For make, the appropriate approach is not as clear, since the results on the individual classes are so skewed. In general, the results confirm our hypothesis that semantic verb classes are highly relevant to measuring the acceptability of LVCs of this type. The results also indicate the need to look in more detail at the properties of different light verbs.</Paragraph> </Section> </Section> <Section position="7" start_page="0" end_page="0" type="metho"> <SectionTitle> 6 Related Work </SectionTitle> <Paragraph position="0"> Other computational research on LVCs differs from ours in two key aspects. First, the work has looked at any nominalizations as complements of potential light verbs (what they term &quot;support verbs&quot;) (Fontenelle, 1993; Grefenstette and Teufel, 1995; Dras and Johnson, 1996). Our work differs in focusing on verbal nouns that form the complement of a particular type of LVC, allowing us to explore the role of class information in restricting the complements of these constructions. Second, this earlier work has viewed all verbs as possible light verbs, while we look at only the class of potential light verbs identified by linguistic theory.</Paragraph> <Paragraph position="1"> The difference in focus on these two aspects of the problem leads to the basic differences in approach: while they attempt to find probable light verbs for nominalization complements, we try to find possible (verbal) noun complements for given light verbs. Our work differs both practically, in the type of measure used, and conceptually, in the formulation of the problem. For example, Grefenstette and Teufel (1995) used some linguistic properties to weed out potential light verbs from lists sorted by raw frequency, while Dras and Johnson (1996) used frequency of the verb weighted by a weak predictor of its prior probability as a light verb. We instead use a standard collocation detection measure (mutual information), the terms of which we modify to 7The development data is similar to the test data in favouring DiffAll over MI across the individual classes. Over all development verbs together, DiffAll is somewhat better than MI for take, is roughly the same for give, and is somewhat worse for make.</Paragraph> <Paragraph position="2"> capture linguistic properties of the construction.</Paragraph> <Paragraph position="3"> More fundamentally, our proposal differs in its emphasis on possible class-based generalizations in LVCs that have heretofore been unexplored. It would be interesting to apply this idea to the broader classes of nominalizations investigated in earlier work. Moreover, our approach could draw on ideas from the earlier proposals to detect the light verbs automatically, since the precise set of LVs differs crosslinguistically--and LV status may indeed be a continuum rather than a discrete distinction.</Paragraph> </Section> <Section position="8" start_page="0" end_page="0" type="metho"> <SectionTitle> 7 Conclusions and Future Work </SectionTitle> <Paragraph position="0"> Our results demonstrate the benefit of treating LVCs as more than just a simple collocation. We exploit linguistic knowledge particular to the &quot;LV a V&quot; construction to devise an acceptability measure that correlates reasonably well with human judgments. By comparing the mutual information with indefinite and definite determiners, we use syntactic patterns to tap into the distinctive underlying properties of the construction.</Paragraph> <Paragraph position="1"> Furthermore, we hypothesized that, because the complement in these constructions is a verb, we would see systematic behaviour across the light verbs in terms of their ability to combine with complements from different verb classes. Our human ratings indeed showed class-based tendencies for the light verbs. Moreover, our acceptability measure showed higher correlations when the verbs were divided by class. This indicates that there is greater consistency within a verb class between the corpus statistics and the ability to combine with a light verb. Thus, the semantic classes provide a useful way to increase the performance of the acceptability measure.</Paragraph> <Paragraph position="2"> The correlations are far from perfect, however. In addition to noise in the data, one problem may be that these classes are too coarse-grained. Exploration is needed of other possible verb (and noun) classes as the basis for generalizing the complements of these constructions. However, we must also look to the measures themselves for improving our techniques. Several linguistic properties distinguish these constructions, but our measures only drew on one. In ongoing work, we are exploring methods for incorporating other linguistic behaviours into a measure for these constructions, as well as for LVCs more generally.</Paragraph> <Paragraph position="3"> We are widening this investigation in other directions as well. Our results reveal interesting differences among the light verbs, indicating that the set of light verbs is itself heterogeneous. More research is needed to determine the properties of a broader range of light verbs, and how they influence the valid combinations they form with semantic classes.</Paragraph> <Paragraph position="4"> Finally, we plan to collect more extensive rating data, but are concerned with the difficulty found in judging these constructions. Gathering solid human ratings is a challenge in this line of investigation, but this only serves to underscore the importance of devising corpus-based acceptability measures in order to better support development of accurate computational lexicons.</Paragraph> </Section> class="xml-element"></Paper>