File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/w97-0808_metho.xml

Size: 21,367 bytes

Last Modified: 2025-10-06 14:14:43

<?xml version="1.0" standalone="yes"?>
<Paper uid="W97-0808">
  <Title>Word Sense Disambiguation for Acquisition of Selectional Preferences</Title>
  <Section position="4" start_page="52" end_page="52" type="metho">
    <SectionTitle>
1. The Model Description Length - the number of
</SectionTitle>
    <Paragraph position="0"> bits to encode the model</Paragraph>
  </Section>
  <Section position="5" start_page="52" end_page="53" type="metho">
    <SectionTitle>
2. The Data Description Length - the number of
</SectionTitle>
    <Paragraph position="0"> bits to encode the data in the model.</Paragraph>
    <Paragraph position="1"> In this way, rather than searching for the classes with the highest association score, MDL searches for the classes which make the best compromise between explaining the data well by having a high association score and providing as simple (general) a model as possible and so minimising the model description length.</Paragraph>
    <Paragraph position="2"> In all the systems described above the input is not disambiguated with respect to word senses. Resnik and Ribas both report erroneous word senses being a major source of error. Ribas explains that this  occurs because some individual nouns occur particularly frequently as complements to a given verb and so all senses of these nouns also get unusually high frequencies.</Paragraph>
    <Paragraph position="3"> Li and Abe place a threshold on class frequencies before consideration of a class. In this way they hope to avoid the noise from erroneous senses. In this paper some modifications to Li and Abe's system are described and a comparison is made of the use of some word sense disambiguation.</Paragraph>
    <Section position="1" start_page="53" end_page="53" type="sub_section">
      <SectionTitle>
2.2 Word Sense Disambiguation
</SectionTitle>
      <Paragraph position="0"> Since the literature on WSD is vast there will be no attempt to describe the variety of current work here.</Paragraph>
      <Paragraph position="1"> Two approaches were investigated as possible ways for pretagging the head nouns that are used as input to the preference acquisition system. These were selected for having a low enough cost to enable tagging of a sufficient amount of text.</Paragraph>
      <Paragraph position="2"> One strategy has been suggested by Wilks and Stevenson in which the most frequent sense is picked regardless of context. In this work they distinguish senses to the homograph level given the correct part of speech and report a 95% accuracy using the most frequent sense specified by LDOCE ranking. This approach has the advantage of simplicity and training data is only needed for the estimation of one parameter, the sense frequencies.</Paragraph>
      <Paragraph position="3"> The other approach selected was Yarowsky's unsupervised algorithm (1995). This has the advantage that it does not require any manually tagged data. His approach relies on initial seed collocations to discriminate senses that can be observed in a portion of the data. This portion is then labelled accordingly. New collocations are extracted from the labeUed sample and ordered by log-likelihood as in equation 3. The new ordered collocations are then used to relable the data and the system iterates between observing and ordering new collocations and re-labelling the data until the stopping condition is met. The final decision list of collocations can then</Paragraph>
      <Paragraph position="5"/>
    </Section>
  </Section>
  <Section position="6" start_page="53" end_page="68" type="metho">
    <SectionTitle>
3 Experiments
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="53" end_page="68" type="sub_section">
      <SectionTitle>
3.1 Word Sense Disambiguation
</SectionTitle>
      <Paragraph position="0"> Preliminary experiments have been carried out using adaptations of the two approaches mentioned above.</Paragraph>
      <Paragraph position="1">  This experiment followed the approach of using the first sense regardless of context. Wilks and Stevenson did this in order to disambiguate LDOCE homographs. Distinguishing between WordNet senses is a much harder problem and so performance was not expected to be as good.</Paragraph>
      <Paragraph position="2"> The only frequency information available for WordNet senses, assuming large scale manual tagging is out of the question, is the portion of the Brown corpus that has been semantically tagged with WordNet senses for the SemCor project (Miller, Leacock, Tengi, &amp; Bunker, 1993). Criteria were used alongside this frequency data specifying when to use the first sense and when to leave the ambiguity untouched. Two criteria were used initially:  1. Fi~EQ - a threshold on the frequency of the first sense . RATIO - a threshold ratio between the frequen null cies of the first sense and next most frequent sense.</Paragraph>
      <Paragraph position="3"> At first FREQ was set at 5 and RATIO at 2.</Paragraph>
      <Paragraph position="4"> The method was then evaluated against the manually tagged sample of the Brown corpus (200,000 words of text) from which the frequency data was obtained and two small manually tagged samples from the LOB corpus (sample size of nouns 179) and the Wall Street Journal corpus (size 191 nouns). The results are shown in table 1. As expected the performance was superior when scored against the same data from which the frequency estimates had been taken.</Paragraph>
      <Paragraph position="5"> Further experimentation was performed using the LOB sample and varying FREQ and RATIO. Additionally a third constraint was added (D). In this nouns identified on the SemCor project as being difficult for humans to tag were ignored.</Paragraph>
      <Paragraph position="6"> The results are shown in table 2. Although the resuits indicate this is rather a limited method of disambiguating it was hoped that it would improve the</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="68" end_page="68" type="metho">
    <SectionTitle>
CRITERIA &amp;quot;l RECALL PRECISION
</SectionTitle>
    <Paragraph position="0"> freq 5 ratio 2 44 69 freq 3 ratio 2 47 69 freq 1 ratio 2 49 67 freq 3 ratio 1.5 50 67 freq 3 ratio 3 39 76 freq 3 ratio 2 (D) 45 71 selectional preference acquisition process whilst also avoiding a heavy cost in terms of human time (for manual tagging) or computer time (for unsupervised training). For the selectional preference acquisition experiments 4 and 6 described below it was decided to use the criteria FREQ 3, RATIO 2 and D (ignore difficult nouns).</Paragraph>
    <Paragraph position="1">  Yarowksy's unsupervised algorithm (1995) was also investigated using WordNet to generate the initial seed collocations. This has the advantage that it does not rely on a quantity of handtagged data however the time taken for training remains an issue. Without optimisation the algorithm took 15 minutes of elapsed time for 710 citations of the word &amp;quot;plant&amp;quot;. Accuracy was reasonable considering a) the quantity of data used (a corpus of 90 million words compared with Yarowsky's 460 million) and b) the simplifications made, imparticular the use of only one type of collocation. 3 On initial experimentation it was evident that predominant senses quickly became favoured. For this reason the measure to order the decision list 3The only collocation used was within a window of 10 words either side of the target. Other simplifications include the use of a constant for smoothing, a rudimentary stopping condition, no use of the one sense per discourse strategy and no alteration of the parameters at run time.</Paragraph>
    <Paragraph position="2"> was changed from log-likelihood to a log of the ratio of association scores as show in equation 4</Paragraph>
    <Paragraph position="4"> This helped overcome the bias of conditional probabilities towards the most frequent sense. Recall is 71% and precision is 72% when using the log-likelihood to order the decision list with a stopping condition that the tagged portion exceeds 95% of the target data. The ratio of association scores compensates for the relative frequencies of the senses and on stopping the recall is 76% and precision is 78% Unfortunately evaluation on the target word &amp;quot;plant&amp;quot; was rather optimistic when contrasted with an evMuation on randomly selected targets involving finer word sense distinctions. In a experiment 390 mid-frequency nouns were trained and the algorithm used to disambiguate the same nouns appearing in the SemCor files of the Brown corpus. This produced only 29% for both recall and precision which was only just better than chance. An important source of error seems to have been the poor quality of the automatically derived seeds.</Paragraph>
    <Paragraph position="5"> On account of the training time that would be required Yarowsky's unsupervised algorithm was abandoned for the purpose of tagging the argument heads. The Wilks and Stevenson style strategy was chosen instead because it requires storage of one parameter only and is exceptionally easy to apply. A major disadvantage for this approach is that lower rank senses do not feature in the data at all. It is hoped that this will not matter where we are collecting information from many heads in a particular slot because any mistagging will be outweighed by correct taggings overall. However this approach would be unhelpful where we want to distinguish behaviour for different word senses. A potential use of Yarowky's algorithm might be verb sense distinction. The experiments outlined in the next section have been conducted using verb form rather than sense. If verbs sense distinction were to be performed it would be important to obtain the preferences for the different senses and would not be appropriate to lump the preferences together under the predominant sense. It is hoped that with some  alteration to the automatic seed derivation and allowance for a coarser grained distinction this would be viable.</Paragraph>
    <Paragraph position="6"> a threshold of 0.1 was adhered to as this not only avoided noise but also reduced the search space.</Paragraph>
    <Section position="1" start_page="68" end_page="68" type="sub_section">
      <SectionTitle>
3.2 Acquisition of Selectional Prefer-
ences
</SectionTitle>
      <Paragraph position="0"> Representation and acquisition of selectional preferences is based on Li and Abe's concept of an ATCM. The details of how such a model is acquired from corpus data using WordNet and the MDL principle is detailed in the papers (Li &amp; Abe, 1995; Abe &amp; Li, 1996).</Paragraph>
      <Paragraph position="1"> The WordNet hypernym noun hierarchy is used here as it is available and ready made. Using a resource produced by humans has its drawbacks, particularly that the classification is not tailored to the task and data at hand and is prone to the inconsistencies and errors that beset any man-made lexical resource. Still the alternative of using an automatically clustered hierarchy has other disadvantages, a particular problem being that techniques so far developed often give rise to semantically incongruous classes (Pereira, Tishby, &amp; Lee, 1993).</Paragraph>
      <Paragraph position="2"> Calculation of the class frequencies is key to the process of acquisition of selectional preferences. Li and Abe estimate class frequencies by dividing the frequencies of nouns occurring in the set of synonyms of a class between all the classes in which they appear. Class frequencies are then inherited up the hierarchy. In order to keep to their definition of a &amp;quot;tree cut&amp;quot; all nouns in the hierarchy need to be positioned at leaves. WordNet does not adhere to this stipulation and so they prune the hierarchy at classes where a noun featured in the set of synonyms has occurred in the data. This strategy was abandoned in the work described here because some words in the data belonged at root classes. For example in the direct object of &amp;quot;build&amp;quot; one instance of the word &amp;quot;entity&amp;quot; occurred which appears at one of the roots in WordNet. If the tree were pruned at the &amp;quot;ENTITY&amp;quot; class there would be no possibility for the preference of &amp;quot;build&amp;quot; to distinguish between the subclasses &amp;quot;LIFE FORM&amp;quot; and &amp;quot;PHYSI-CAL OBJECT&amp;quot;.</Paragraph>
      <Paragraph position="3"> As an alternative strategy in this work, new leaf classes were created for every internal class in the WordNet hierarchy so that terminals only occurred at leaves but the detail of WordNet was left intact.</Paragraph>
      <Paragraph position="4"> Li and Abe's strategy of pruning at classes less than  The input data was produced by the system described in (Briscoe ~ Carroll, 1997) and comprised 2 million words of parsed text with argument heads and subcategorisation frames identified. Only argument heads consisting of common nouns, days of the week and months and personal pronouns with the exception of &amp;quot;it&amp;quot; were used. The personal pronouns were all tagged with the &amp;quot;SOMEONE&amp;quot; class which is unambiguous in WordNet. Selectional preferences were acquired for a handful of verbs using either subject or object position. In experiment 3 class frequencies were calculated in much the same way as in Li and Abe's original experiments, dividing frequencies for each noun between the set of classes in which they featured as synonyms. In experiment 4 the nouns in the target slots were disambiguated using the approach outlined in experiment 1. Where frequency data was not available for the target word the word was simply treated as ambiguous and class frequencies were calculated as in experiment 3.</Paragraph>
      <Paragraph position="5"> Since ATCMs have only been obtained for the subject and object slot and for 10 target verbs no formal evaluation has been performed as yet. Instead the ATCMs were examined and some observations are given below along with diagrams showing some of the models obtained. For clarity only some of the nodes have been shown and classes are labelled with some of the synonyms belonging to that class in WordNet.</Paragraph>
      <Paragraph position="6"> In order to obtain the ATCMs &amp;quot;tree cut models&amp;quot; (TCMs) for the target slot, irrespective of verb are obtained. A TCM is similar to an ATCM except that the scores associated with each class on the cut are probabilities and should sum to 1. The TCMs obtained for a given slot with and without WSD were similar.</Paragraph>
      <Paragraph position="7"> In contrast ATCMs are produced with a small data set specific to the target verb. The verbs in our target set having between 32 ('clean') and 2176 ('make') instances. Because of this the noise from erroneous senses is not as easily filtered and WSD does seem to make a difference although this depends on the verb and the degree of polysemy of the most common arguments.</Paragraph>
      <Paragraph position="8"> &amp;quot;Eat&amp;quot; is a verb which selects strongly for its ob- null i .... &amp;quot;l , ~ Shaded boxes ....... for new leaf created I entity 1 for internal node  ATCM no WED ............... &amp;quot; -&amp;quot; - 2:&amp;quot; '/ ATCM with WSIT ........ i ,'&amp;quot; &amp;quot;,  pictured in figure I. The ATCMs are similar but WSD gives slightly stronger scores to the appropriate nodes. Additionally the NATURAL OBJECT class changes from a slight preference in the ATCM without WSD to a score below 1 (indicating no evidence for a preference) with WSD. WSD appears to slightly improve the preferences acquired but the difference is small. The reasons are that there is a predominant sense of &amp;quot;eat&amp;quot; which selects strongly for its direct object and many of the heads in the data were monosemous (e.g. food, sandwich and pretzel).</Paragraph>
      <Paragraph position="9"> In contrast &amp;quot;establish&amp;quot; only has 79 instances and without any WSD the ATCM consists of the root node with a score of 1.8. This shows that without WSD the variety of erroneous senses causes gross over-generalisation when compared to the cut with WSD as pictured in figure 2. There are cases where the WSD is faulty and many heads are not covered by the criteria outlined in experiment 1. The head &amp;quot;right&amp;quot; for example contributes to a higher association score at the LOCATION node though its correct sense really falls under the ABSTRACTION node. However even with these inadequacies the cut with WSD appears to provide a reasonable set of preferences as opposed to the cut at the root node which is uninformative.</Paragraph>
      <Paragraph position="10"> There was no distinction of verb senses for the preferences acquired and the data and ATCM for &amp;quot;serve&amp;quot; highlights this. &amp;quot;Serve&amp;quot; has a number of senses including the sense of &amp;quot;meet the needs of&amp;quot; or &amp;quot;set food on the table&amp;quot; or &amp;quot;undergo a due period'. The heads in direct object position could on the whole be identified as belonging to one or other of these senses. The ATCM with WSD is illustrated in figure 3 The GROUP, RELATION and MENTAL OBJECT nodes relate to the first sense, the SUB- null STANCE to the second and the third sense to the STATE and RELATION nodes. The ATCM without WSD was again an uninformative cut at the root. Ideally preferences should be acquired respective to verb sense otherwise the preferences for the different predicates will be confused.</Paragraph>
      <Paragraph position="11"> Although formal evaluation has as yet to be performed the models examined so far with the crude WSD seem to improve on those without. This is especially so in cases of sparse data.</Paragraph>
      <Paragraph position="12"> Some errors were due to the parser. For example time adverbials such as &amp;quot;the night before&amp;quot; were mistaken as direct objects when the parser failed to identify the passive as in :&amp;quot;... presented a lamb, killed the night before&amp;quot;. Errors also arose because collocations such as &amp;quot;post office&amp;quot; were not recognised~as such. Despite these errors the advantages Of using automatic parsing are significant in terms of the quantity of data thereby made available and portability to new domains.</Paragraph>
    </Section>
    <Section position="2" start_page="68" end_page="68" type="sub_section">
      <SectionTitle>
3.3 Word Sense Disambiguation using
Selection Preferences
</SectionTitle>
      <Paragraph position="0"> The tree cuts obtained in experiments 3 and 4 have been used for WSD in a bootstrapping approach where heads, disambiguated by selectional preferences, are then used as input data to the preference acquisition system. WSD using the ATCMs simply selects all senses for a noun that fall under the node in the cut with the highest association score with senses for this word. For example the sense of &amp;quot;chicken&amp;quot; under VICTUALS would be preferred over the senses under LIFE FORM when occurring as the direct object of &amp;quot;eat&amp;quot;. The granularity of the WSD depends on how specific the cut is. The approach has not been evaluated formally although we have plans to so with SemCor. A small evaluation has been performed comparing the manually tagged direct objects of &amp;quot;eat&amp;quot; with those selected using the cuts from experiment 3. The coarse tagging is considered correct when the identified senses contain the manually selected one. This provides a recall of 62% and precision of 93% which can be compared to a baseline precision of 55% which is calculated as in equation 6 Number.-Sensesu Under_Cut ~neHeads Number_Sensesn Total.Heads_Covered (6) Naturally this approach will work better for verbs which select more strongly for their arguments.</Paragraph>
      <Paragraph position="1"> Further experiments have been conducted which feed the disambiguated heads back into the selectional preference acquisition system.</Paragraph>
      <Paragraph position="2">  In experiment 5 cuts obtained in experiment 3, without any initial WSD, are used to disambiguate the heads before these are then fed back in. In contrast experiment 6 uses the cuts obtained with Wilks and Stevenson style WSD from experiment 4 to disambiguate the heads. In both cases the cuts are only used to dis ambiguate the heads appearing with the target verb and the full data sample required for the prior distribution TCM is left as in experiments 3 and 4.</Paragraph>
      <Paragraph position="3"> Where the verb selects strongly for its arguments, for example &amp;quot;eat&amp;quot;, the cuts obtained in experiments 5 and 6 were similar to those achieved with initial Wilks and Stevenson WSD, for example both have the effect of taking the class NATURAL OBJECT below 1 (i.e. removing the weak indication of a preference). null In contrast where the quantity of data is sparse and the verb selects less strongly the cut obtained from fully ambiguous data (experiment 5) is unhelpful for WSD. However if the Wilks and Stevenson style disambiguation is performed on the initial data the cuts in experiment 6 show great improvement on those from experiment 4. For example the ATCM in experiment 6 for &amp;quot;establish&amp;quot; showed no preferences for the LOCATION and POSSESSION nodes where preferences in experiment 4 had arisen because of erroneous word senses.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML