File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/95/e95-1016_metho.xml
Size: 15,871 bytes
Last Modified: 2025-10-06 14:14:02
<?xml version="1.0" standalone="yes"?> <Paper uid="E95-1016"> <Title>On Learning more Appropriate Selectional Restrictions</Title> <Section position="3" start_page="113" end_page="114" type="metho"> <SectionTitle> 3 Variations on the association </SectionTitle> <Paragraph position="0"> statistical measure In this section we consider different variations on the association score in order to make it more robust. The different techniques are experimentally evaluated in section 4.2.</Paragraph> <Section position="1" start_page="113" end_page="113" type="sub_section"> <SectionTitle> 3.1 Variations on the prior probability </SectionTitle> <Paragraph position="0"> When considering the prior probability, the more independent of the context it is the better to measure actual associations. A sensible modification of the measure would be to consider p(c) as the prior distribution:</Paragraph> <Paragraph position="2"> Using the chain rule on mutual information (Cover and Thomas, 1991, p. 22) we can mathematically relate the different versions of Assoc, mssoc'(v, s, c) = p(clv, s)log ~+Assoc(v, s, c) The first advantage of Assoc' would come from this (information theoretical) relationship. Specifically, the AssoF takes into account the preference (selection) of syntactic positions for particular classes. In intuitive terms, typical subjects (e.g. <person, individual, ...>) would be preferred (to atypical subjects as <suit_of_clothes>) as SRs on the subject in contrast to Assoc. The second advantage is that as long as the prior probabilities, p(c), involve simpler events than those used in Assoc, p(cls), the estimation is easier and more accurate (ameliorating data sparseness).</Paragraph> <Paragraph position="3"> A subsequent modification would be to estimate the prior, p(c), from the counts of all the nouns appearing in the corpus independently of their syntactic positions (not restricted to be heads of verbal complements). In this way, the estimation of p(c) would be easier and more accurate.</Paragraph> </Section> <Section position="2" start_page="113" end_page="113" type="sub_section"> <SectionTitle> 3.2 Estimating class probabilities from </SectionTitle> <Paragraph position="0"> noun frequencies In the global weighting technique presented in equation 2 very polysemous nouns provide the same amount of evidence to every sense as non-ambiguous nouns do -while less ambiguous nouns could be more informative about the correct classes as long as they do not carry ambiguity. The weight introduced in (1) could alternatively be found in a local manner, in such a way that more polysemous nouns would give less evidence to each one of their senses than less ambiguous ones. Local weight could be obtained using p(cJn). Nevertheless, a good estimation of this probability seems quite problematic because of the lack of tagged training material. In absence of a better estimator we use a rather poor one as the uniform distribution,</Paragraph> <Paragraph position="2"> Resnik (1993) also uses a local normalization technique but he normalizes by the total number of classes in the hierarchy. This scheme seems to present two problematic features (see (Ribas, 1994b) for more details). First, it doesn't take dependency relationships introduced by hyperonymy into account. Second, nouns categorized in lower levels in the taxonomy provide less weight to each class than higher nouns.</Paragraph> </Section> <Section position="3" start_page="113" end_page="114" type="sub_section"> <SectionTitle> 3.3 Other statistical measures to score </SectionTitle> <Paragraph position="0"> SRs In this section we propose the application of other measures apart from Assoc for learning SRs: log-likelihood ratio (Dunning, 1993), relative entropy (Cover and Thomas, 1991), mutual information ratio (Church and Hanks, 1990), C/2 (Gale and Church, 1991). In section (4) their experimental evaluation is presented.</Paragraph> <Paragraph position="1"> The statistical measures used to detect associations on the distribution defined by two random variables X and Y work by measuring the deviation of the conditional distribution, P(XJY), from the expected distribution if both variables were considered independent, i.e. the marginal distribution, P(X). If P(X) is a good approximation</Paragraph> <Paragraph position="3"> of P(XIY), association measures should be low (near zero), otherwise deviating significantly from zero.</Paragraph> <Paragraph position="4"> Table 2 shows the cross-table formed by the conditional and marginal distributions in the case of</Paragraph> <Paragraph position="6"> ciation measures use the information provided in the cross-table to different extents. Thus, Assoc and mutual information ratio consider only the deviation of the conditional probability p(c\[v,s) from the corresponding marginal, p(c).</Paragraph> <Paragraph position="7"> On the other hand, log-likelihood ratio and C/2 measure the association between v_s and c considering the deviation of the four conditional cells in table 2 from the corresponding marginals. It is plausible that the deviation of the cells not taken into account by Assoc can help on extracting useful Sits.</Paragraph> <Paragraph position="8"> Finally, it would be interesting to only use the information related to the selectional behavior of v_s, i.e. comparing the conditional probabilities of c and -~c given v_s with the corresponding marginals. Relative entropy, D(P(XIv_s)IIP(X)) , could do this job.</Paragraph> </Section> </Section> <Section position="4" start_page="114" end_page="116" type="metho"> <SectionTitle> 4 Evaluation </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="114" end_page="115" type="sub_section"> <SectionTitle> 4.1 Evaluation methods of SRs </SectionTitle> <Paragraph position="0"> Evaluation on NLP has been crucial to fostering research in particular areas. Evaluation of the SR learning task would provide grounds to compare different techniques that try to abstract SRs from corpus using WordNet (e.g, section 4.2). It would also permit measuring the utility of the SRs obtained using WordNet in comparison with other frameworks using other kinds of knowledge. Finally it would be a powerful tool for detecting flaws of a particular technique (e.g, (Ribas, 1994a) analysis).</Paragraph> <Paragraph position="1"> However, a related and crucial issue is which linguistic tasks are used as a reference. SRs are useful for both lexicography and NLP. On the one hand, from the point of view of lexicography, the goal of evaluation would be to measure the quality of the SRs induced, (i.e., how well the resulting classes correspond to the nouns as they were used in the corpus). On the other hand, from the point of view of NLP, StLs should be evaluated on their utility (i.e., how much they help on performing the reference task).</Paragraph> <Paragraph position="2"> As far as lexicography (quality) is concerned, we think the main criteria SRs acquired from corpora should meet are: (a) correct categorization -inferred classes should correspond to the correct senses of the words that are being generalized-, (b) appropriate generalization level and (c) good coverage -the majority of the noun occurrences in the corpus should be successfully generalized by the induced SRs.</Paragraph> <Paragraph position="3"> Some of the methods we could use for assessing experimentally the accomplishment of these criteria would be: * Introspection A lexicographer checks if the SRs accomplish the criteria (a) and (b) above (e.g., the manual diagnosis in table 1). Besides the intrinsic difficulties of this approach, it does not seem appropriate when comparing across different techniques for learning SRs, because of its qualitative flavor.</Paragraph> <Paragraph position="4"> * Quantification of generalization level appropriateness A possible measure would be the percentage of sense occurrences included in the induced SRs which are effectively correct (from now on called Abstraction Ratio). Hopefully, a technique with a higher abstraction ratio learns classes that fit the set of examples better. A manual assessment of the ratio confirmed this behavior, as testing sets With a lower ratio seemed to be inducing less ~Abs cases.</Paragraph> <Paragraph position="5"> * Quantification of coverage It could be measured as the proportion of triples whose correct sense belongs to one of the SRs.</Paragraph> <Paragraph position="6"> The NLP tasks where SRs utility could be evaluated are diverse. Some of them have already been introduced in section 1. In the recent literature ((Grishman and Sterling, 1992), (Resnik, 1993), ...) several task oriented schemes to test Selectional Restrictions (mainly on syntactic ambiguity resolution) have been proposed. However, we have tested SRs on a WSS task, using the following scheme. For every triple in the testing set the algorithm selects as most appropriate that noun-sense that has as hyperonym the SR class with highest association score. When more than one sense belongs to the highest SR, a random selection is performed. When no SR has been acquired, the algorithm remains undecided. The results of this WSS procedure are checked against a testing-sample manually analyzed, and precision and recall ratios are calculated. Precision is calculated as the ratio of manual-automatic matches / number of noun occurrences disambiguated by the procedure. Recall is computed as the ratio of manual-automatic matches / total number of noun occurrences.</Paragraph> </Section> <Section position="2" start_page="115" end_page="116" type="sub_section"> <SectionTitle> 4.2 Experimental results </SectionTitle> <Paragraph position="0"> In order to evaluate the different variants on the association score and the impact of thresholding we performed several experiments. In this section we analyze the results. As training set we used the 870,000 words of WSJ material provided in the ACL/DCI version of the Penn Treebank. The testing set consisted of 2,658 triples corresponding to four average common verbs in the Treebank: rise, report, seek and present. We only considered those triples that had been correctly extracted from the Treebank and whose noun had the correct sense included in WordNet (2,165 triples out of the 2,658, from now on, called the testingsample). null As evaluation measures we used coverage, abstraction ratio, and recall and precision ratios on the WSS task (section 4.1). In addition we performed some evaluation by hand comparing the SRs acquired by the different techniques.</Paragraph> <Paragraph position="1"> Coverage for the different techniques is shown in table 3. The higher the coverage, the better the technique succeeds in correctly generalizing more of the input examples. The labels used for referring to the different techniques are as follows: &quot;Assoc & p(cls)&quot; corresponds to the basic association measure (section 2), &quot;Assoc & Head-nouns&quot; and &quot;Assoc & All nouns&quot; to the techniques introduced in section 3.1, &quot;Assoe & Normalizing&quot; to the local normalization (section 3.2), and finally, log-likelihood, D (relative entropy) and I (mutual information ratio) to the techniques discussed in section 3.3.</Paragraph> <Paragraph position="2"> The abstraction ratio for the different techniques is shown in table 4. In principle, the higher abstraction ratio, the better the technique succeeds in filtering out incorrect senses (less tAbs). The precision and recall ratios on the noun WSS task for the different techniques are shown in table 5. In principle, the higher the precision and recall ratios the better the technique succeeds in inducing appropriate SRs for the disambiguation task.</Paragraph> <Paragraph position="3"> As far as the evaluation measures try to account for different phenomena the goodness of a particular technique should be quantified as a trade-off. Most of the results are very similar (differences are not statistically significative). Therefore we should be cautious when extrapolating the results. Some of the conclusions from the tables above are: 1. 4) 2 and I get sensibly worse results than other measures (although abstraction is quite good).</Paragraph> <Paragraph position="4"> 2. The local normalizing technique using the uniform distribution does not help. It seems that by using the local weighting we misinform the algorithm. The problem is the reduced weight, that polysemous nouns get, while they seem to be the most informative 4.</Paragraph> <Paragraph position="5"> However, a better informed kind of local weight (section 5) should improve the technique significantly.</Paragraph> <Paragraph position="6"> 3. All versions of Assoc (except the local normalization) get good results. Specially the two techniques that exploit a simpler prior distribution, which seem to improve the basic technique.</Paragraph> <Paragraph position="7"> 4. log-likelihood and D seem to get slightly worse results than Assoc techniques, although the results are very similar.</Paragraph> <Paragraph position="8"> We were also interested in measuring the impact of thresholding on the SRs acquired. In figure 1 we can see the different evaluation measures of it might be expected, as the threshold increases (i.e. some cases are not classified) the two ratios slightly diverge (precision increases and recall diminishes). null Figure 1 also shows the impact of thresholding on coverage and abstraction ratios. Both decrease&quot; when threshold increases, probably because when the rejecting threshold is low, small classes that fit the data well can be induced, learning overgeneral or incomplete SRs otherwise.</Paragraph> <Paragraph position="9"> Finally, it seems that precision and abstraction ratios are in inverse co-relation (as precision grows, abstraction decreases). In terms of WSS, general classes may be performing better than classes that fit the data better. Nevertheless, this relationship should be further explored in future work.</Paragraph> </Section> </Section> <Section position="5" start_page="116" end_page="116" type="metho"> <SectionTitle> 5 Conclusions and future work </SectionTitle> <Paragraph position="0"> In this paper we have presented some variations affecting the association measure and thresholding on tile basic technique for learning SRs fi'om on-line corpora. We proposed some evaluation measures for the SRs learning task. Finally, experimental results on these variations were reported.</Paragraph> <Paragraph position="1"> We can conclude that some of these variations seem to improve the results obtained using the basic technique. However, although the technique still seems far from practical application to NLP tasks, it may be most useful for providing experimental insight to lexicographers. Future lines of research will mainly concentrate on improving the local normalization technique by solving the noun sense ambiguity. We have foreseen the application of the following techniques: * Simple techniques to decide the best sense c given the target noun n using estimates of the n-grams: P(e), f(eln), P(clv, s) and P(cJv, s,n), obtained from supervised and un-supervised corpora.</Paragraph> <Paragraph position="2"> * Combining the different n-grams by means of smoothing techniques.</Paragraph> <Paragraph position="3"> * Calculating P(elv ,s,n) combining P(nle ) and P(clv ,s), and applying the EM Algorithm (Dempster et al., 1977) to improve the model.</Paragraph> <Paragraph position="4"> * Using the WordNet hierarchy as a source of backing-off knowledge, in such a way that if n-grams composed by c aren't enough to decide the best sense (are equal to zero), the tri-grams of ancestor classes could be used instead.</Paragraph> </Section> class="xml-element"></Paper>