File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/j02-3004_metho.xml
Size: 56,640 bytes
Last Modified: 2025-10-06 14:07:49
<?xml version="1.0" standalone="yes"?> <Paper uid="J02-3004"> <Title>c(c) 2002 Association for Computational Linguistics The Disambiguation of Nominalizations</Title> <Section position="5" start_page="367" end_page="369" type="metho"> <SectionTitle> 4. The Disambiguation Algorithm </SectionTitle> <Paragraph position="0"> The disambiguation algorithm for nominalizations is summarized in Figure 1. The algorithm uses verb-argument tuples to infer the relation holding between the modifier and its nominalized head. When the co-occurrence frequency of the verb-argument relations is zero, verb-argument tuples are smoothed using one of the methods described in Section 3.</Paragraph> <Paragraph position="1"> Once frequencies (either actual or reconstructed through smoothing) for verb-argument relations have been obtained, the RA score determines the relation between the (see Section 2). The sign of the RA score indicates which relation, subject or object, is more likely: a positive RA score indicates an object relation, whereas a negative score indicates a subject relation. Depending on the task and the data at hand, we can require that an object or subject analysis be preferred only if RA exceeds a certain threshold j (see steps 7 and 8 in Figure 1). We can also impose a threshold k on the type of verb-argument tuples we smooth. If, for instance, we know that the parser's output is noisy, then we might choose to smooth not only unseen verb-argument pairs but also pairs with nonzero corpus frequencies (e.g., f(verb</Paragraph> <Paragraph position="3"> [?] 1; see steps 3 and 4 in Figure 1).</Paragraph> <Paragraph position="4"> Consider, for example, the compound student administration: its corresponding verb-noun configuration (e.g., administer student) is not attested in the BNC. This is a case in which we need smoothed estimates for both f(v</Paragraph> <Paragraph position="6"> ). The re-created frequencies using the class-based smoothing method described in Section 3.2 are 5.06 and 2.59, respectively, yielding an RA score of .96 (see Table 5), which means that it is more likely that student is the object of administration. Consider now the compound unit establishment: here, we have very little evidence in the corpus with respect to the verb-subject relation (see Table 5, where f(establish, subj, unit)=1).</Paragraph> <Paragraph position="7"> Assuming we have set the threshold k to 2 (see steps 4 and 5 in Figure 1) we need only re-create the frequency for the subject relation (e.g., 14.99 using class-based smoothing). The resulting RA score is again positive (see Table 5), which indicates that there is a greater probability for unit to be the object of establishment than for it to be the subject. Finally, consider the compound government promotion: counts for both subject and object relations are found in the BNC (see Table 5), in which case no smoothing is involved; we need only calculate the RA score (see step 6 in Figure 1), which is negative, indicating that government is more likely to be the subject of promotion than its object.</Paragraph> </Section> <Section position="6" start_page="369" end_page="379" type="metho"> <SectionTitle> 5. Experiments </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="369" end_page="370" type="sub_section"> <SectionTitle> 5.1 Methodology </SectionTitle> <Paragraph position="0"> The algorithm described in the previous section and the smoothing variants were evaluated on the task of disambiguating nominalizations. As detailed above, the Jensen-Shannon divergence and confusion probability measures are parameterized.</Paragraph> <Paragraph position="1"> This means that we need to establish empirically the best parameter values for the size of the vocabulary (i.e., number of verbs used to find the nearest neighbors) and, for the Jensen-Shannon divergence, the effect of the b parameter. Recall from Section 2.2.2 that we obtained 796 nominalizations from the BNC. From these, 596 were used as training data for finding the optimal parameters for the two variants of distance-weighted averaging. The 596 nominalizations were also used to find the optimal thresholds for the interpretation algorithm. The remaining 200 nominalizations were retained as test data and also to evaluate whether human judges can reliably disambiguate the argument relation between the nominalized head and its modifier (see Experiment 1).</Paragraph> <Paragraph position="2"> In Experiment 2 we investigate how the different smoothing techniques detailed in Section 3 influence the disambiguation task. As far as class-based smoothing is concerned, we experiment with two concept hierarchies, Roget's thesaurus and WordNet.</Paragraph> <Paragraph position="3"> Although no parameter tuning is necessary for class-based and back-off smoothing, we Computational Linguistics Volume 28, Number 3 maintain the train/test data distinction also for these methods to facilitate comparisons with distance-weighted averaging.</Paragraph> <Paragraph position="4"> We also examine whether knowledge of the semantics of the suffix of the nominalized head can improve performance. We run two versions of the algorithm presented in Section 4: in one version the algorithm assumes no prior knowledge about the semantics of the nominalization suffix (see Figure 1); in the other version the algorithm ) only for compounds with nominalization suffixes other than -er, -or, -ant,or-ee. For compounds with suffixes -er, -or and -ant (e.g., datum holder, car collector, water disinfectant), the algorithm defaults to an object interpretation, and it defaults to a subject analysis for compounds with the suffix -ee (e.g., university employee). Compounds with heads ending in these four suffixes represented 13.6% of the compounds contained in the train set and 10.8% of the compounds in the test set.</Paragraph> <Paragraph position="5"> In Experiment 3 we explore how the combination of the different smoothing methods influences disambiguation performance; we also consider context as an additional predictor of the argument relation of a deverbal head and its modifier and combine these distinct information sources using Ripper (Cohen 1996), a machine learning system that induces sets of rules from preclassified examples.</Paragraph> <Paragraph position="6"> In what follows we briefly describe our study on assessing how well humans agree on disambiguating nominalizations. This study establishes an upper bound for the task against which our automatic methods will be compared. Sections 5.3 and 5.4 present our results on the disambiguation task.</Paragraph> </Section> <Section position="2" start_page="370" end_page="371" type="sub_section"> <SectionTitle> 5.2 Experiment 1: Agreement </SectionTitle> <Paragraph position="0"> Two graduate students in linguistics decided whether modifiers were the subject or object of a given nominalized head. The judges were given a page of guidelines but no prior training. The nominalizations were disambiguated in context: the judges were given the corpus sentence in which the nominalization occurred together with the previous and following sentence. We measured the judges' agreement using the kappa coefficient (Siegel and Castellan 1988), which is the ratio of the proportion of times P(A) that k raters agree (corrected by chance agreement P(E)) to the maximum proportion of times the raters would agree (corrected for chance agreement):</Paragraph> <Paragraph position="2"> If there is a complete agreement among the raters, then K = 1, whereas if there is no agreement among the raters (other than the agreement that would be expected to occur by chance), then K = 0.</Paragraph> <Paragraph position="3"> The judges' agreement on the disambiguation task was K = .78 (N = 200, k = 2). This translates into a percentage agreement of 89.7%. Although the Kappa coefficient has a number of advantages over percentage agreement (e.g., it takes into account the expected chance interrater agreement; see Carletta (1996) for details), we also report percentage agreement as it allows us to compare straightforwardly the human performance and the automatic methods described below, whose performance will also be reported in terms of percentage agreement. Furthermore, percentage agreement establishes an intuitive upper bound for the task (i.e., 89.7%), allowing us to interpret how well our empirical models are doing in relation to humans.</Paragraph> <Paragraph position="4"> Finally, note that the level of agreement was good, given that the judges were provided with minimal instructions and no prior training. Even though context was provided to aid the disambiguation task, however, the judges were not in complete Disambiguation accuracy as the number of similar neighbors (i.e., number of verbs over which the similarity function is calculated) is varied for P C and J.</Paragraph> <Paragraph position="5"> agreement. This points to the intrinsic difficulty of the task at hand. Argument relations and consequently selectional restrictions are influenced by several pragmatic factors that may not be readily inferred from the immediate context (see Section 6 for discussion).</Paragraph> </Section> <Section position="3" start_page="371" end_page="374" type="sub_section"> <SectionTitle> 5.3 Experiment 2: Comparison of Smoothing Variants </SectionTitle> <Paragraph position="0"> Before reporting the results of the disambiguation task, we describe our initial experiments on finding the optimal parameter settings for the two distance-weighted averaging smoothing methods.</Paragraph> <Paragraph position="1"> Figure 2 shows how performance on the disambiguation task varies with respect to the number and frequency of verbs over which the similarity function is calculated. The y-axis in Figure 2 shows how performance on the training set varies (for both P C and J divergence) when verb-argument pairs are selected for the 1,000 most frequent verbs in the corpus, the 2,000 most frequent verbs in the corpus, etc. (x-axis). The best performance for both similarity functions is achieved with the 2,000 most frequent verbs. Furthermore, J and P C yield comparable performances (68.0% and 68.3%, respectively under that condition). Another important observation is that performance deteriorates less severely for P C than for J as the number of verbs increases: when all verbs for which verb-argument tuples are extracted from the BNC are used, the accuracy for P C is 66.9%, whereas the accuracy for J is 62.8%. These results are perhaps unsurprising: verb-argument pairs with low-frequency verbs introduce noise due to the errors inherent in the partial parser. Table 6 shows the 10 closest words to the verb accept according to P C as the number of verbs is varied: the quality of the closest neighbors deteriorates with the inclusion of less frequent verbs. Finally, we analyzed the role of the parameter b. Recall that b appears in the weight function for the Jensen-Shannon divergence and controls the influence of the most similar words: the contribution of the closest neighbors increases with a high value for b. Figure 3 shows how the value of b affects performance on the disambiguation task when the similarity function is computed for the 1,000 and 2,000 most frequent verbs in the corpus. It is clear that performance is low with high or very low b values accept decline decline decline decline incl refuse accept tender tender re-issued decline reject refuse accept abdicate co-manage re-issued submit delegate table accept tender co-manage endorse reject disclaim table oversubscribe tender approve repudiate plate wangle backdate goodwill issue hitch shirk disclaim abdicate oversubscribe implement shoulder refuse plate accept pre-arrange acknowledge delegate proffer shirk table backdate incur ratify apportion disdain wangle abdicate Figure 3 Disambiguation accuracy for J as b is varied for the 1,000 and 2,000 most frequent verbs in the BNC.</Paragraph> <Paragraph position="2"> (e.g., b [?]{2, 9}). We chose to set the parameter b to five, and the results shown in ) and the Jensen-Shannon divergence (J)), influence performance in predicting the relation between a modifier and its nominalized head. For the distance-weighted averaging methods we report the results obtained with the optimal parameter settings (b = 5; 2,000 most frequent verbs). The results in Table 7 were obtained without taking the semantics of the nominalization suffix (-er, -or, -ant, -ee) into account (see Section 5.1).</Paragraph> <Paragraph position="3"> Let us concentrate on the training set first. The back-off method is outperformed by all other methods, although its performance is comparable to that of class-based smoothing using Roget's thesaurus (63.1% and 65.1%, respectively). Distance-weighted averaging methods outperform concept-based methods, although not considerably (accuracy on the training set was 68.3% for P C and 68.0% for class-based smoothing using WordNet). Furthermore, the particular concept hierarchy used for class-based smoothing seems to have an effect on disambiguation performance: an increase of approximately 3.0% is obtained by using WordNet instead of Roget's thesaurus. One explanation might be that Roget's thesaurus is too coarse-grained a taxonomy for the task at hand. We used the chi-square statistic to examine whether the observed performance is better than the simple default strategy of always choosing an object relation, which yields an accuracy of 59.0% in the training data (see D in Table 7). The proportion of nominalizations classified correctly was significantly greater than 59.0% (p <.01) for all methods but back-off (B) and Roget (Ro).</Paragraph> <Paragraph position="4"> Similar results are observed on the test set. Again P C outperforms all other methods, achieving an accuracy of 75.8% (see Table 7). The portion of nominalizations classified correctly by P C is significantly greater than 61.5% (kh = 9.37, p <.01), which is the percentage of object relations in the test set. The second-best method is class-based smoothing using WordNet (see Table 7). WordNet's performance is also significantly better (kh = 5.64, p <.05) than the baseline. The back-off method, class-based smoothing using Roget's thesaurus, and J yield comparable results (see Table 7). Table 8 shows how each method performs when knowledge about the semantics of the nominalization suffix is taken into account. Recall that compounds with agentive and passive suffixes (i.e., -er, -or, -ant, and -ee) represent 13.6% of the training data and 10.8% of the test data. A general observation is that knowledge of the semantics of the nominalization suffix does not dramatically influence accuracy. Performance on the test data increases 1.5% for Wn, 1.0% for Ro and 0.5% for distance-weighted averaging (see</Paragraph> <Paragraph position="6"> in Table 8). We observe no increase in performance for back-off smoothing (see Tables 7 and 8). These results suggest that the nominalization suffixes do not contribute much additional information to the interpretation task, as their meaning can be successfully retrieved from the corpus.</Paragraph> <Paragraph position="7"> An interesting question is the extent to which any of the different methods agree in their assignments of subject and object relations. We investigated this by calculating the methods' agreement on the training set using the Kappa coefficient. We calculated the Kappa coefficient for all pairwise combinations of the five smoothing variants.</Paragraph> <Paragraph position="8"> The results are reported in Table 9. The highest agreement is observed for P C and the class-based smoothing using the WordNet taxonomy (K = .75). Agreement between J and P C as well as agreement between Wn and Ro is rather low (K = .53 and K = .46, respectively). Note that generally low agreement is observed when B is paired with</Paragraph> <Paragraph position="10"> , Wn,orRo. This is not entirely unexpected, given the assumptions underlying the different smoothing techniques. Both class-based and distance-weighted averaging methods recreate the frequency of unseen word combinations by relying on corpus evidence for words that are distributionally similar to the words of interest. In distance-weighted averaging smoothing, word similarity is estimated from lexical co-occurrence information, whereas in taxonomic class-based smoothing, similarity emerges from the hierarchical organization of conceptual information. Back-off smoothing, however, incorporates no notion of similarity: unseen sequences are estimated using not similar conditional distributions, but lower-level ones. This also relates to the fact that B's performance is lower than Wn and P C (see Table 7), which suggests that smoothing methods that incorporate linguistic hypotheses (i.e., the notion of similarity) perform better than methods relying simply on co-occurrence distributions. To summarize, the agreement values in Table 9 suggest that methods inducing similarity relationships from corpus co-occurrence statistics are not necessarily incompatible with methods that quantify similarity using manually crafted taxonomies and that different smoothing techniques may be appropriate for different tasks.</Paragraph> <Paragraph position="11"> Table 10 shows how the different methods compare for the task of predicting the individual argument relations for the training and test sets. A general observation is that all methods are fairly good at predicting object relations. Predicting sub-ject relations is considerably harder: no method exceeds an accuracy of 54.9% (see Table 10). One explanation for this is that selectional constraints imposed on subjects can be more easily overridden by pragmatic and contextual factors than those imposed on objects. Furthermore, selectional constraints on subjects are normally weaker than on objects. J is particularly good at predicting object relations, whereas</Paragraph> <Paragraph position="13"/> </Section> <Section position="4" start_page="374" end_page="379" type="sub_section"> <SectionTitle> 5.4 Experiment 3: Using Ripper to Disambiguate Nominalizations </SectionTitle> <Paragraph position="0"> An obvious question is whether a better performance can be achieved by combining the five smoothing variants, given that they seem to provide complementary information for predicting argument relations. For example, Wn, Ro, and P C are relatively good for the prediction of subject relations , whereas J is best for the prediction of object relations (see Table 10). Furthermore, note that the probabilistic model introduced in Section 2 and the algorithm based on it (see Section 4) ignore contextual information that can provide important cues for disambiguating nominalizations. Consider the nominalization government promotion in (23a), which was assigned an object (instead of a subject) interpretation by all smoothing variants except Wn. Contextual information could help assign the correct interpretation in cases in which the head of the compound is followed by prepositions such as of (see (23a)) or into (see (23b)). Lapata The Disambiguation of Nominalizations (23) a. It was not felt necessary to take account of government promotion of unionism.</Paragraph> <Paragraph position="1"> b. But politicians are calling for the Republic's Government to start a Court inquiry into Ross' alleged links with firms in Eire.</Paragraph> <Paragraph position="2"> In the following we first examine whether combination of the five smoothing variants improves performance at predicting the argument relations for nominalizations (see Section 5.4.1). We then proceed to study the influence of context on the interpretation task; we explore the contribution of context alone (see Section 5.4.2) and in combination with the different smoothing variants (see Section 5.4.3). The different information sources are combined using Ripper (Cohen 1996), a system that induces classification rules from a set of preclassified examples. Ripper takes as input the classes to be learned (in our case the classification is binary, i.e., subject or object), the names and possible values of a set of features, and training data specifying the class and feature values for each training example. In our experiments the features are the smoothing variants and the tokens surrounding the nominalizations in question. The feature vector in (24a) represents the individual predictions of B, Wn, Ro, J, and P C for the interpretation of government promotion (see (23a)). We encode the context surrounding nominalizations using two distinct representations: (a) parts of speech and (b) lemmas. In both cases we encode the position of the tokens with respect to the nominalization in question. The feature vector in (24b) consists of the nominalization court inquiry (see (23b)), represented by its parts of speech (nn1 and nn1, respectively) and a context of five words to its right and five words to its left, also reduced to their parts of speech. In (24c) the same tokens are represented by their lemmas.</Paragraph> <Paragraph position="3"> (24) a. [obj, subj, obj, obj, obj] b. [pos, nn0, to0, vvi, aj0, nn1, nn1, prp, pos, aj0, nn2, prp] c. ['s government to start a court inquiry into Ross 's alleged link] Ripper is trained on vectors of values like the ones presented in (24) and outputs a classification model for classifying future examples. The model is learned using greedy search guided by an information gain metric and is expressed as an ordered set of if-then rules. For our experiments Ripper was trained on the 596 nominalizations on which the smoothing methods were compared and tested on the 200 unseen nominalizations for which the interjudge agreement was previously calculated (see Section 5.2).</Paragraph> <Paragraph position="4"> when different combinations of smoothing variants (i.e., features) are used without taking context into account. All results in Table 11 were obtained using the version of the interpretation algorithm that takes suffix semantics into account (see Section 5.3). As shown in Table 11, the combination of all five smoothing variants achieves a performance of 80.4%.</Paragraph> <Paragraph position="5"> Table 11 further reports the accuracy achieved when removing 2 An anonymous reviewer pointed out that suffix information could be alternatively exploited by including the ending suffix of the nominalization head as an additional feature for the classification task. The latter approach yields comparable performance to our original idea of defaulting to the argument structure denoted by the nominalization suffix. When B, J, P C , Ro, and Wn are used as features together with nominalization suffixes (-age, -ion, -ment, etc.), Ripper's performance is 79.9% +- 1.65 on the training data and 80.3% +- 2.95% on the test data.</Paragraph> <Paragraph position="6"> a single feature. Evaluation on subsets of features allows us to explore the contribution of individual features to the classification task by comparing the subsets to the full feature set. We see that removal of Ro has no effect on the results, whereas removal of J produces a 5.7% performance decrease. Removing Wn or P C yields the same decrease in performance (i.e., 0.5%). This is not surprising, since P C and Wn tend to agree in their assignments of subject and object relations (see the methods' agreement in Table 9), and therefore their combination is not expected to be very informative. Absence of J from the feature set yields the most dramatic performance decrease. This is not unexpected, given that J is the best predictor for object relations and that P C and WordNet behave similarly with respect to their interpretation decisions. In general we observe that the combination of smoothing variants outperforms their individual performances (compare Tables 11 and 8). Comparison of Ripper's best performance (80.4%) against the individual smoothing methods reveals a 10.8% accuracy increase over B, J, and Ro, a 4.1% increase over P C , and a 6.2% increase over Wn.</Paragraph> <Paragraph position="7"> We further analyzed Ripper's performance at predicting object and subject relations. This information is displayed in Table 12, in which we show how performance varies on the full set of n size features (i.e., five) and each of its n[?]1 size subsets. As can be seen in Table 12, accuracy at predicting subject relations increases when smoothing variants are combined (compare Tables 12 and 10). In fact, combination of B, J, Wn, and Ro (or B, J, P C , and Ro) performs best at predicting subject relations, achieving an increase of 24% over P C , the best individual predictor for subject relations (see Table 10). In sum, our results show that combination of the different smoothing variants (using Ripper) achieves better results than each individual method. Our overall performance (i.e., 80.4%) outperforms the default baseline significantly, by 18.9% (kh = 17.33, p <.05) and is 9.3% lower than the upper bound established in our agreement study (see Section 5.2). In what follows we first examine the independent contribution of context to the disambiguation performance and then turn to its combination with our five smoothing variants.</Paragraph> <Paragraph position="8"> both the position and the size of the window of tokens (i.e., lemmas or parts of speech) surrounding the nominalization. We varied the window size parameter between one and five words before and after the nominalization target. We use the symbols l and r for left and right context, respectively, subscripts to denote the context encoding (i.e., lemmas or parts of speech), and numbers to express the size of the window surrounding the candidate compound. For example, l l = 5 represents a window of five tokens, encoded as lemmas, to the left of the candidate compound. Tables 13 and 14 show the influence of right and left context, respectively, represented as lemmas. The best peformances are achieved with a window of two words to the right or left of the candidate nominalization (see the features r</Paragraph> <Paragraph position="10"> Tables 13 and 14, respectively). Combination of the best left and right features (r</Paragraph> <Paragraph position="12"> = 2) does not increase the disambiguation performance (70.4% +- 1.86% on the training and 66.5% +- 3.41% on the test data). Note that the disambiguation performance simply using contextual features is not considerably worse than the performance of some smoothing variants (see Table 7). Contextual features encoded as lemmas out-perform part-of-speech (POS) tags, for which the best performance is achieved with a window of one token to the right or a window of three tokens to the left of the candidate nominalization (see Tables 15 and 16). As in the case of lemmas, combination of the best left and right features (r</Paragraph> <Paragraph position="14"> = 3) does not yield better results (66.3% +1.94% on the training data and 66.5% +- 3.40% on the test data). The lower performance of POS tags is not entirely unexpected: lemmas capture lexical dependencies that are somewhat lost when a more general level of representation is introduced. For example, Ripper assigns a subject interpretation when for immediately follows a nominalization head (e.g., staff requirement for reconnaissance). This rule cannot be induced when for is represented by its part of speech (e.g., PRP), as there are a number of prepositions that can follow the nominalization head, but only a few of them provide cues for its argument structure.</Paragraph> <Paragraph position="15"> Table 17 shows the performance of the best contextual features for the task of predicting the individual argument relations. The contextual features are consistently better at predicting object than subject relations. This is not surprising, given that ob-</Paragraph> <Paragraph position="17"> ject relations represent the majority in both the training and test data; furthermore, identifying superficial features that are good predictors for subject relations is a relatively hard task. For example, even though Ripper identifies prepositions (e.g., of, to) following the nominalization head and certain frequent nominalization heads (e.g., behavior) as indicators of subject relations, it has no means of guessing the transitivity of deverbal heads in the absense of syntactic cues. Consider example (25a), in which neither left nor right context is informative with regard to the fact that intervene is intransitive.</Paragraph> <Paragraph position="18"> Finally, there are some cases in which the syntactic cues can be misleading, as adjacency to the nominalization target does not necessarily indicate argument structure. This is shown in (25b), in which youth is classified as the subject of manager. Although on the surface youth manager at is analogous to nominalizations followed by of (e.g., government promotion of), the prepositional phrase at Wimbledon in (25b) is simply locative and not the argument of manager.</Paragraph> <Paragraph position="19"> (25) a. If the second reminder produces no result or the reply to either reminder seems to indicate the need for court intervention the matter will be referred to a master or district judge.</Paragraph> <Paragraph position="20"> b. He was youth manager at Wimbledon when I held a similar position at Palace.</Paragraph> <Paragraph position="21"> 5.4.3 Combination of Context with Smoothing Variants. In this section we investi null gate whether the combination of surface contextual features with the predictions of the different smoothing methods has an effect on the disambiguation performance. Although context is good at predicting object relations, it performs poorly at guessing subject relations (see Table 17). We expect the combination of context with smoothing variants (some of which, e.g., Wn, Ro, and P C , perform relatively well at the predicting subject relations) to improve performance. Recall that the probabilistic model introduced in Section 2.1 and the interpretation algorithm that makes use of it attempt the interpretation of nominalizations without taking contextual cues into account. Here, we examine how well the different smoothing variants perform in the presence of contextual information. Table 18 shows Ripper's performance when the best context (i.e., r l = 2) is combined with a single smoothing method and with all five variants. For the smoothing variants, we used the version of the interpretation algorithm that takes suffix semantics into account (see Table 8).</Paragraph> <Paragraph position="22"> Comparison between Tables 8 and 18 reveals that the inclusion of context generally increases performance. Combination of B with the best context yields a 6.7% increase over B; an increase of 8.8% (over J) and 7.7% (over Ro) is observed when J and Ro are combined with context, respectively. No increase in performance is observed when</Paragraph> <Paragraph position="24"> context is combined with P C (see Table 18), whereas combination of Wn with context yields a 11.9% increase over Wn alone. Combining all five smoothing variants with context yields an increase of 4.7% over just the combination of B, J, P C , Ro, and Wn (see Table 12). Our best performance (i.e., 86.1%) is achieved when Wn is combined with right context (r l = 2); this performance is significantly better than the simple strategy of always defaulting to a subject classification, which yields an accuracy of 61.5% (kh = 30.64, p <.05), and only 3.6% lower than the upper bound of 89.7%. As shown in Table 19, the inclusion of context increases accuracy when it comes to the prediction of subject relations (with the exception of P C , which is relatively good at predicting subject relations, and therefore in that case the inclusion of context does not add much useful information). The combination of Wn with r l = 2 achieves the highest accuracy (87.3%) at predicting subject relations.</Paragraph> </Section> </Section> <Section position="7" start_page="379" end_page="383" type="metho"> <SectionTitle> 6. Discussion </SectionTitle> <Paragraph position="0"> We have described an empirical approach for the automatic interpretation of nominalizations. We cast the interpretation task as a disambiguation problem and proposed a statistical model for inferring the argument relations holding between a deverbal head and its modifier. Our experiments revealed that the interpretation task suffers from data sparseness: even an approximation that maps the nominalized head to its underlying verb does not provide sufficient evidence for quantifying the argument relation of a modifier noun and its nominalized head.</Paragraph> <Paragraph position="1"> We showed how the argument relations (which are not readily available in the corpus) can be retrieved by using partial parsing and smoothing techniques that exploit Computational Linguistics Volume 28, Number 3 distributional and taxonomic information. We compensated for the lack of sufficient distributional information using either methods that directly recreate the frequencies of word combinations or contextual features whose distribution in the corpus indirectly provides information about nominalizations. We compared and contrasted a variety of smoothing approaches proposed in the literature and demonstrated that their combination yields satisfactory results for the demanding task of semantic disambiguation. We also explored the contribution of context and showed that it is useful for the disambiguation task. Our approach is applicable to domain-independent unrestricted text and does not require the hand coding of semantic information. In the following sections we discuss our results and their potential usefulness for NLP applications. We also address the limitations of our approach and sketch potential extensions.</Paragraph> <Section position="1" start_page="380" end_page="380" type="sub_section"> <SectionTitle> 6.1 The Interpretation of Nominalizations </SectionTitle> <Paragraph position="0"> Our results indicate that a simple probabilistic model that uses smoothed counts (see the interpretation algorithm in Section 4) yields a significant increase over the base-line without taking context into account. Distance-weighted smoothing using P C and class-based smoothing using WordNet achieve the best results (76.3% and 74.2%, respectively). Combination of different smoothing methods (using Ripper) yields an overall performance of 80.4%, again without taking context into consideration. Context alone achieves a disambiguation performance of 68.6%, approximating the performance of some of the smoothing variants (see Tables 9 and 13). This result suggests that simple features that can be easily retrieved and estimated from the corpus contain enough information to capture generalizations about the behavior of nominalizations. As expected, the combination of smoothed probabilities with context outperforms the accuracy of individual smoothing variants. The combination of WordNet with a right context of size two achieves an accuracy of 86.1%, compared to an upper bound for the task (i.e., intersubject agreement) of 89.7%. This is an important result considering the simplifications in the system and the sparse data problems encountered in the estimation of the model probabilities. The second-best performance is achieved when J is combined with context (78.4%; see Table 18). This result shows that information inherent in the corpus can make up for the lack of distributional evidence and furthermore that it is possible to extract semantic information from corpora (even if they are not semantically annotated in any way) without recourse to pre-existing taxonomies such as WordNet.</Paragraph> </Section> <Section position="2" start_page="380" end_page="383" type="sub_section"> <SectionTitle> 6.2 Limitations and Extensions </SectionTitle> <Paragraph position="0"> To a certain extent the difficulty of interpreting nominalizations is due to their context dependence. Although the approach presented in the previous sections takes immediate context into account, it does so in a shallow manner, without having access to the meaning of the words surrounding the nominalization target, their syntactic dependencies, or the general discourse context within which the compound is embedded.</Paragraph> <Paragraph position="1"> Consider example (26a), in which the compound computer guidance receives a subject interpretation (e.g., the computer guides the chef). Our approach cannot detect that the computer here is ascribed animate qualities and opts for the most likely interpretation (i.e., an object analysis). In some cases the modifier stands in a metonymic relation to its head. Consider the examples in sentences (26b, 26c), in which the nominalizations industry reception and market acceptance can be thought of as instances of the metonymic schema &quot;whole for part&quot; (Lakoff and Johnson 1980). In example (26b) it is the industry as a whole that receives the guests rather than lasmo, which is one of its parts, Lapata The Disambiguation of Nominalizations whereas in (26c) the modifier market in market acceptance refers to the opinion leaders, who are part of the market.</Paragraph> <Paragraph position="2"> (26) a. Of course, none of this means that the equipment is taking anything away from the chef's own individual skills which are irreplaceable.</Paragraph> <Paragraph position="3"> What it does ensure is that the chef has complete control over some of the most vital tools of his trade, with computer guidance as an important aid.</Paragraph> <Paragraph position="4"> b. The final evening saw more than 300 guests attend an industry reception, hosted by lasmo.</Paragraph> <Paragraph position="5"> c. Marketers interested in the development and introduction of new products will be particularly interested in the attitude of opinion leaders to these products, for their general market acceptance can be slowed down or speeded up by the views of such people.</Paragraph> <Paragraph position="6"> Consider now sentence (27a). The nominalization student briefing is ambiguous, even though it is presented within its immediate context. Taking more context into account (see (27a)) does not provide enough disambiguation information either, although perhaps it introduces a slight bias in favor of an object interpretation (i.e., someone is briefing the students). For this particular example, we would have to know what the document within which student briefing occurs is about (i.e., a list of teaching guidelines for university lecturers). The sentences in (27) are taken from a document section entitled &quot;Work Experience&quot; that emphasizes the importance of work experience for students. Given all this background information, it becomes apparent that it is not the students who are doing the briefing in (27b).</Paragraph> <Paragraph position="7"> (27) a. Explain to both students and organisations the role of work experience in personal development and its part in the planned programme.</Paragraph> <Paragraph position="8"> b. Provide comprehensive guidelines on the work experience which includes a student briefing, an employer briefing and a student work checklist.</Paragraph> <Paragraph position="9"> The observation that discourse or pragmatic context may influence interpretations is by no means new or particular to nominalisations. Sparck Jones (1983) observes that a variety of factors can potentially influence the interpretation of compound nouns in general. These factors range from syntactic analysis (e.g., to arrive at an interpretation of the compound onion tears, it is necessary to identify that tears is a noun and not the third-person singular of the verb tear) to semantic information (e.g., for interpreting onion tears, it is important to know that onions cannot be tears or that tears are not made of onions) and pragmatic information. Pragmatic inference may be called for in cases in which syntactic or semantic information is straightforwardly supplied, even where the local text context provides rich information bearing on the interpretation of the compound. Copestake and Lascarides (1997) and Lascarides and Copestake (1998) make the same observation for a variety of constructions such as compound nouns, adjective-noun combinations and verb-argument relations. Consider the sentences in (28)-(30). The discourse in (28) favors the interpretation &quot;bag for cotton clothes&quot; for cotton bag over the more likely interpretation &quot;bag made of cotton.&quot; Although fast programmer is typically a programmer who programs fast, when the adjective-noun combination is embedded in a context like (29a, 29b), the less likely meaning &quot;a Computational Linguistics Volume 28, Number 3 programmer who runs fast&quot; is triggered. Finally, although it is more likely to enjoy reading a book rather than eating it, the context in (30) triggers the latter interpretation. null (28) a. Mary sorted her clothes into various bags made from plastic.</Paragraph> <Paragraph position="10"> b. She put her skirt into the cotton bag.</Paragraph> <Paragraph position="11"> (29) a. All the office personnel took part in the company sports day last week. b. One of the programmers was a good athlete, but the other was struggling to finish the courses.</Paragraph> <Paragraph position="12"> c. The fast programmer came first in the 100m.</Paragraph> <Paragraph position="13"> (30) a. My goat eats anything.</Paragraph> <Paragraph position="14"> b. He really enjoyed your book.</Paragraph> <Paragraph position="15"> Pragmatic context may be particularly important for the interpretation of compound nouns. Because compounds can be used as a text compression device (Marsh 1984), it is plausible that pragmatic inference is required to supply the compound's interpretation. This observation is somewhat supported by our interannotator agreement experiment (see Section 5.2). Even though our participants were provided with some context, the agreement among them was not complete (they reached a K of .78, when absolute agreement is 1). Although our approach takes explicit contextual information into account, it is agnostic to discourse or pragmatic information. Encoding pragmatic information would involve considerable manual effort. Furthermore, a hypothetical statistical learner that takes pragmatic information into account would have not only to deal with data sparseness but furthermore to detect cases in which conflicts arise between discourse information and the likelihood of a given interpretation.</Paragraph> <Paragraph position="16"> Our experiments focused on nominalizations derived from verbs specifically subcategorizing for direct objects. Although nominalizations whose verbs take prepositional frames (e.g., oil painting, soccer competition) represent a small fraction of the nominalizations found in the corpus (9.2%), a more general approach would have to take those verbs into account. This task is harder than interpreting direct objects, since to estimate the frequency f(v ), one needs first to determine with some degree of accuracy the attachment site of the prepositional phrase. Taking into account prepositional phrases and their attachment sites can also be useful for the interpretation of compounds other than nominalizations. Consider the compound noun pet spray from (1). Assuming that pet spray can be &quot;spray for pets,&quot; &quot;spray in pets,&quot; &quot;spray about pets,&quot; or &quot;spray from pets,&quot; we can derive the most likely interpretation by looking at which types of prepositional phrases (e.g., for pets, about pets) are most likely to attach to spray. Note that in cases in which the expressions spray for pets and spray in pets are not attested in the corpus, their respective co-occurrence frequencies can be re-created using the techniques presented in Section 3.</Paragraph> <Paragraph position="17"> Finally, the approach advocated here can be straightforwardly extended to nominalizations with adjectival modifiers (e.g., parental refusal; see the examples in (2)). In most cases the adjective in question is derived from a noun, and any inference process on the argument relations between the head noun and the adjectival modifier could take advantage of this information.</Paragraph> </Section> <Section position="3" start_page="383" end_page="383" type="sub_section"> <SectionTitle> Lapata The Disambiguation of Nominalizations 6.3 Relevance for NLP Applications </SectionTitle> <Paragraph position="0"> Robust semantic ambiguity resolution is challenging for current NLP systems. Although general-purpose taxonomies like WordNet or Roget's thesaurus are useful for certain interpretation tasks, such resources are not exhaustive or generally available for languages other than English. Furthermore, the compound noun interpretation task involves acquiring semantic information that is linguistically implicit and therefore not directly available in corpora or taxonomic resources. Indeed, interpreting compound nouns is often analyzed in the linguistics literature in terms of (impractical) general-purpose reasoning with pragmatic information such as real-world knowledge (e.g., Hobbs et al. 1993; see Section 7 for details). We show that it is feasible to learn implicit semantic information automatically from the corpus by utilizing linguistically principled approximations, surface syntactic cues, and (when available) taxonomic information.</Paragraph> <Paragraph position="1"> The interpretation of compound nouns is important for several NLP tasks, notably machine translation. Consider the nominalization satellite observation (taken from (4a)), which may mean &quot;observation by satellite&quot; or &quot;observation of satellites.&quot; To translate satellite observation into Spanish, we have to work out whether satellite is the subject or object of the verb observe. In the first case satellite observation translates as observaci'on por satelite (observation by satellite), whereas in the latter it translates as observaci'on de satelites (observation of satellites). In this case the implicit linguistic information has to be retrieved and disambiguated to obtain a meaningful translation. Information retrieval is another relevant application in which again the underlying meaning must be rendered explicit. Consider a search engine faced with the query cancer treatment.</Paragraph> <Paragraph position="2"> Presumably one would not like to obtain information about cancer or treatment in general, but about methods or medicines that help treat cancer. So knowledge about the fact that cancer is the object of treatment could help rank relevant documents (i.e., documents in which cancer is the object of the verb treat) before nonrelevant ones or restrict the number of retrieved documents.</Paragraph> </Section> </Section> <Section position="8" start_page="383" end_page="385" type="metho"> <SectionTitle> 7. Related Work </SectionTitle> <Paragraph position="0"> In this section we review previous work on the interpretation of compound nouns and compare it to our own work. Despite the differences among them, most approaches require large amounts of hand-crafted knowledge, place emphasis on the recovery of relations other than nominalizations (see the examples in (1)), contain no quantitative evaluation (the exceptions are Leonard (1984), Vanderwende (1994), and Lauer (1995)), and generally assume that context dependence is either negligible or of little impact.</Paragraph> <Paragraph position="1"> Most symbolic approaches are limited to a specific domain because of the large effort involved in hand-coding semantic information and are distinguished in two main types: concept-based and rule-based.</Paragraph> <Paragraph position="2"> Under the concept-based approach, each noun in the compound is associated with a concept and various slots. Compound interpretation reduces to slot filling, that is, evaluating how appropriate concepts are as fillers of particular slots. A scoring system evaluates each possible interpretation and selects the highest-scoring analysis.</Paragraph> <Paragraph position="3"> Examples of the approach are Finin (1980) and McDonald (1982). As no qualitative evaluation is reported in these studies, it is difficult to assess how their methods perform, although it is clear that considerable effort needs to be invested in the encoding of the appropriate semantic knowledge.</Paragraph> <Paragraph position="4"> Under the rule-based approach, interpretation is performed by sequential rule application. A fixed set of rules is applied in a fixed order, and the first rule that is semantically compatible with the nouns forming the compound results in the most Computational Linguistics Volume 28, Number 3 plausible interpretation. The approach was introduced by Leonard (1984), was based on a hand-crafted lexicon, and achieved an accuracy of 76.0% (on the training set). Vanderwende (1994) further developed a rule-based algorithm that does not rely on a hand-crafted lexicon but extracts the required semantic information from an on-line dictionary instead. The system achieved an accuracy of 52.0%.</Paragraph> <Paragraph position="5"> A variant of the concept-based approach uses unification to constrain the semantic relations between nouns represented as feature structures. Jones (1995) used a typed graph-based unification formalism and default inheritance to specify features for nouns whose combination results in different interpretations. Again no evaluation is reported, although Jones points out that ambiguity can be a problem, as all possible interpretations are produced for a given compound. Wu (1993) provides a statistical framework for the unification-based approach and develops an algorithm for approximating the probabilities of different possible interpretations using the maximum-entropy principle. No evaluation of the algorithm's performance is given. The approach remains knowledge intensive, however, as it requires manual construction of the feature structures.</Paragraph> <Paragraph position="6"> Lauer (1995) provides a probabilistic model of compound noun paraphrasing (e.g., state laws are &quot;the laws of the state,&quot; war story is &quot;a story about war&quot;) that assigns probabilities to different paraphrases using a corpus in conjunction with Roget's thesaurus. Lauer does not address the interpretation of nominalizations or compounds with hyponymic relations (see example (1e)) and takes into account only prepositional paraphrases of compounds (e.g., of, for, in, at, etc.). Lauer's model makes predictions about the meaning of compound nouns on the basis of observations about prepositional phrases. The model combines the probability of the modifier given a certain preposition with the probability of the head given the same preposition and assumes that these two probabilities are independent.</Paragraph> <Paragraph position="7"> Consider, for instance, the compound war story. To derive the intended interpretation (i.e., &quot;story about war&quot;), the model takes into account the frequency of story about and about war. For the modifier and head noun are substituted the concepts with which they are represented in Roget's thesaurus, and the frequency of a concept and a preposition is calculated accordingly (see Section 3.2). Lauer's (1995) model achieves an accuracy of 47.0%. The result is difficult to interpret, given that no experiments with humans are performed and therefore the optimal performance on the task is unknown. Lauer acknowledges that data sparseness can be a problem for the estimation of the model parameters and also that the assumption of independence between the head and its modifier is unrealistic and leads to errors in some cases.</Paragraph> <Paragraph position="8"> Although it is generally acknowledged that context, both intra- and intersentential, may influence the interpretation task, contextual factors are typically ignored, with the exception of Hobbs et al. (1993), who propose that the interpretation of a compound can be achieved via abductive inference. To interpret a compound one must prove the logical form of its constituent parts from what is mutually known. The amount of world knowledge required to work out what is mutually known, however, renders such an approach infeasible in practice. Furthermore, Hobbs et al.'s approach does not capture linguistic constraints on compound noun formation and as a result cannot predict that a noun-noun sequence like cancer lung (under the interpretation &quot;cancer in the lung&quot;) is odd.</Paragraph> <Paragraph position="9"> Unlike previous work, we did not attempt to recover the semantic relations holding between a head and its modifier (see (1)). Instead, we focused on the less ambitious task of interpreting nominalizations, that is, compounds whose heads are derived from a verb and whose modifiers are interpreted as its arguments. Similarly to Lauer (1995), we have proposed a simple probabilistic model that uses information about Lapata The Disambiguation of Nominalizations the distributional properties of words and domain-independent symbolic knowledge (i.e., WordNet, Roget's thesaurus). Unlike Lauer, we have addressed the sparse-data problem by directly comparing and contrasting a variety of smoothing approaches proposed in the literature and have shown that these methods yield satisfactory results for the demanding task of semantic disambiguation. Furthermore, we have shown that the combination of different sources of taxonomic and nontaxonomic information (using Ripper) is effective for tasks facing data sparseness. In contrast to previous approaches, we explored the effect of context on the interpretation task and showed that its inclusion generally improves disambiguation performance. We combined different information sources (e.g., contextual features and smoothing variants) using Ripper. Although the use of classifiers has been widespread in studies concerning discourse segmentation (Passonneau and Litman 1997), the disambiguation of discourse cues (Siegel and McKeown 1994), the acquisition of lexical semantic classes (Merlo and Stevenson 1999; Siegel 1999), the automatic identification of user corrections in spoken dialogue systems (Hirschberg, Litman, and Swerts 2001), and word sense disambiguation (Pedersen 2001), the treatment of the interpretation of compound nouns as a classification task is, to our knowledge, novel.</Paragraph> <Paragraph position="10"> Our approach can be easily adapted to account for Lauer's (1995) paraphrasing task. Instead of assuming that the probability of the compound modifier given a preposition is independent from the probability of the compound head given the same preposition, a more straightforward model would take into account the joint probability of the head, the preposition, and the modifier. In cases in which a certain head, preposition, and modifier combination is not attested in the corpus (e.g., story about war), the methodology put forward in Experiments 2 and 3 could be used to re-create its frequency (see also the discussion in Section 6).</Paragraph> <Paragraph position="11"> Unlike previous approaches, we provide an upper bound for the task. Recall from Section 5.2 that an experiment with humans was performed to evaluate whether the task can be performed reliably. In doing so we took context into account, and as a result we established a higher upper bound for the task than would have been the case if context was not taken into account. Furthermore, it is not clear whether subjects could arrive at consistent interpretations for nominalizations out of context. Downing's (1977) experiments show that, when asked to interpret compounds out of context, participants tend to come up with a variety of interpretations that are not always compatible. For example, for the compound bullet hole, the interpretations &quot;a hole made by a bullet,&quot; &quot;a hole shaped like a bullet,&quot; &quot;a fast-moving hole,&quot; &quot;a hole in which to hide bullets,&quot; and &quot;a hole into which to throw (bullet) casings&quot; were provided.</Paragraph> </Section> class="xml-element"></Paper>