File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/a00-2034_metho.xml
Size: 22,350 bytes
Last Modified: 2025-10-06 14:07:08
<?xml version="1.0" standalone="yes"?> <Paper uid="A00-2034"> <Title>Using Semantic Preferences to Identify Verbal Participation in Role Switching Alternations.</Title> <Section position="4" start_page="256" end_page="257" type="metho"> <SectionTitle> 3 Method </SectionTitle> <Paragraph position="0"> We use both syntactic and semantic information for identifying participants in RSAs. Firstly, syntactic processing is used to find candidates taking the alternating SeEs. Secondly, selectional preference models are acquired for the argument heads associated with a specific slot in a specific SCF of a verb.</Paragraph> <Paragraph position="1"> We use the SCF acquisition system of Briscoe and Carroll (1997), with a probabilistic LR parser (Inui et al., 1997) for syntactic processing. The corpus data is POS tagged and lemmatised before the LR parser is applied. Subcategorization patterns are extracted from the parses, these include both the syntactic categories and the argument heads of the constituents.</Paragraph> <Paragraph position="2"> These subcategorization patterns are then classified according to a set of 161 SeE classes. The SeE entries for each verb are then subjected to a statistical filter which removes SCFs that have occurred with a frequency less than would be expected by chance.</Paragraph> <Paragraph position="3"> The resulting SCF lexicon lists each verb with the SCFs it takes. Each SCF entry includes a frequency count and lists the argument heads at all slots.</Paragraph> <Paragraph position="4"> Selectional preferences are automatically acquired for the slots involved in the role switching. We refer to these as the target slots. For the causative alternation, the slots are the direct object slot of the transitive SCF and the subject slot of the intransitive. For the conative, the slots are the direct object of the transitive and the PP of the np v pp SCF.</Paragraph> <Paragraph position="5"> Selectional preferences are acquired using the method devised by Li and Abe (1995). The preferences for a slot are represented as a tree cut model (TCM). This is a set of disjoint classes that partition the leaves of the WordNet noun hypernym hierarchy. A conditional probability is attached to each of the classes in the set. To ensure the TCM covers all the word senses in WordNet, we modify Li and Abe's original scheme by creating hyponym leaf classes below all WordNet's hypernym (internal) classes. Each leaf holds the word senses previously held at the internal class. The nominal argument heads from a target slot are collected and used to populate the WordNet hierarchy with frequency information. The head lemmas are matched to the classes which contain them as synonyms. Where a lemma appears as a synonym in more than one class, its frequency count is divided between all classes for which it has direct membership. The frequency counts from hyponym classes are added to the count for each hypernym class. A root node, created above all the WordNet roots, contains the total frequency count for all the argument head lemmas found within WordNet. The minimum description length principle (MDL) (Rissanen, 1978) is used to find the best TCM by consid- null ering the cost (in bits) of describing both the model and the argument head data encoded in the model.</Paragraph> <Paragraph position="6"> The cost (or description length) for a TCM is calculated according to equation 1. The number of parameters of the model is given by k, this is the number of classes in the TCM minus one. S is the sample size of the argument head data. The cost of describing each argument head (n) is calculated using the log of the probability estimate for the classes on the TCM that n belongs to (Cn).</Paragraph> <Paragraph position="7"> k description length = ~ x log ISI- E logp(cn) (1) nES A small portion of the TCM for the object slot of start in the transitive frame is displayed in figure 1. WordNet classes are displayed in boxes with a label which best reflects the sense of the class. The probability estimates are shown for the classes along the TCM. Examples of the argument head data are displayed below the WordNet classes with dotted lines indicating membership at a hyponym class beneath these classes.</Paragraph> <Paragraph position="8"> We assume that verbs which participate will show a higher degree of similarity between the preferences at the target slots compared with non-participating verbs. To compare the preferences we compare the probability distributions across WordNet using a measure of distributional similarity. Since the probability distributions may be at different levels of WordNet, we map the TCMs at the target slots to a common tree cut, a &quot;base cut&quot;. We experiment with two different types of base cut. The first is simply a base cut at the eleven root classes of WordNet. We refer to this as the &quot;root base cut&quot; (I~BC). The second is termed the &quot;union base cut&quot; (tJBC). This is obtained by taking all classes from the union of the tWO TCMs which are not subsumed by another class in this union. Duplicates are removed. Probabilities are assigned to the classes of a base cut using the estimates on the original TCM. The probability estimate for a hypernym class is obtained by combining the probability estimates for all its hyponyms on the original cut. Figure 2 exemplifies this process for two TOMs (TCM1 and TCM2) in an imaginary hierarchy.</Paragraph> <Paragraph position="9"> The UBC is at the classes B, c and D.</Paragraph> <Paragraph position="10"> To quantify the similarity between the probability distributions for the target slots we use the a-skew divergence (aSD) proposed by Lee (1999). 1 This measure, defined in equation 2, is a smoothed version of the Kulback-Liebler divergence, pl(x) and p2(x) are the two probability distributions which are being compared. The ~ constant is a value between 0 and 1 We also experimented with euclidian distance, the L1 norm, and cosine measures. The differences in performance of these measures were not statistically significant.</Paragraph> <Paragraph position="11"> 1 which smooths pl(x) with p2(z) so that ~SD is always defined. We use the same value (0.99) for as Lee. If a is set to 1 then this measure is equivalent to the Kulback-Liebler divergence.</Paragraph> <Paragraph position="13"/> </Section> <Section position="5" start_page="257" end_page="260" type="metho"> <SectionTitle> 4 Experimental Evaluation </SectionTitle> <Paragraph position="0"> We experiment with a SCF lexicon produced from 19.3 million words of parsed text from the BNC (Leech, 1992). We used the causative and conative alternations, since these have enough candidates in our lexicon for experimentation. Evaluation is performed on verbs already filtered by the syntactic processing. The SCF acquisition system has been evaluated elsewhere (Briscoe and Carroll, 1997).</Paragraph> <Paragraph position="1"> We selected candidate verbs which occurred with 10 or more nominal argument heads at the target slots. The argument heads were restricted to those which can be classified in the WordNet hypernym hierarchy. Candidates were selected by hand so as to obtain an even split between candidates which did participate in the alternation (positive candidates) and those which did not (negative candidates). Four human judges were used to determine the &quot;gold standard&quot;. The judges were asked to specify a yes or no decision on participation for each verb. They were Mso permitted a don't know verdict. The kappa statistic (Siegel and Castellan, 1988) was calculated to ensure that there was significant agreement between judges for the initial set of candidates. From these, verbs were selected which had 75% or more agreement, i.e. three or more judges giving the same yes or no decision for the verb.</Paragraph> <Paragraph position="2"> For the causative alternation we were left with 46 positives and 53 negatives. For the conative alternation we had 6 of each. In both cases, we used the Mann Whitney U test to see if there was a significant relationship between the similarity measure and participation. We then used a threshold on the similarity scores as the decision point for participation to determine a level of accuracy. We experimented with both the mean and median of the scores as a threshold. Seven of the negative causative candidates were randomly chosen and removed to ensure an even split between positive and negative candidates for determining accuracy using the mean and median as thresholds.</Paragraph> <Paragraph position="3"> The following subsection describes the results of the experiments using the method described in section 3 above. Subsection 4.2 describes an experiment on the same data to determine participation using a similarity measure based on the intersection of the lemmas at the target slots.</Paragraph> <Section position="1" start_page="258" end_page="259" type="sub_section"> <SectionTitle> 4.1 Using Syntax and Selectional Preferences </SectionTitle> <Paragraph position="0"> The results for the causative alternation are displayed in table 1 for both the rt~c and the uBc. The relationship between participation and ~SD is highly significant in both cases, with values of p well below 0.01. Accuracy for the mean and median thresholds are displayed in the fourth and fifth columns. Both thresholds outperform the random baseline of 50%.</Paragraph> <Paragraph position="1"> The results for the vl3c are slightly improved, compared to those for the rtBc, however the improvement is not significant.</Paragraph> <Paragraph position="2"> The numbers of false negative (FN) and false positive (FP) errors for the mean and median thresholds are displayed in table 2, along with the threshold and accuracy. The outcomes for each individual verb for the experiment using the RBC and the mean threshold are as follows: * True negatives: add admit answer believe borrow cost declare demand expect feel imagine know notice pay perform practise proclaim read remember sing survive understand win write * True positives: accelerate bang bend boil break burn change close cook cool crack decrease drop dry end expand fly improve increase match melt open ring rip rock roll shatter shut slam smash snap spill split spread start stop stretch swing lilt turn wake * False negatives: flood land march repeat terminate * False positives: ask attack catch choose climb drink eat help kick knit miss outline pack paint plan prescribe pull remain steal suck warn wash The results for the uBc experiment are very similar. If the median is used, the number of FPs and FNs are evenly balanced. This is because the median threshold is, by definition, taken midway between the test items arranged in order of their similarity scores. There are an even number of items on either side of the decision point, and an even number of positive and negative candidates in our test sample. Thus, the errors on either side of the decision point are equal in number.</Paragraph> <Paragraph position="3"> For both base cuts, there are a larger number of false positives than false negatives when the mean is used. The mean produces a higher accuracy than the median, but gives an increase in false positives. Many false positives arise where the preferences at both target slots are near neighbours in WordNet. For example, this occurred for eat and drink. There verbs have a high probability mass (around 0.7) under the entity class in both target slots, since both people and types of food occur under this class. In cases like these, the probability distributions at the asc, and frequently the UBC, are not sufficiently distinctive. null The polysemy of the verbs may provide another explanation for the large quantity of false positives. The SCFS and data of different senses should not ideally be combined, at least not for coarse grained sense distinctions. We tested the false positive and true negative candidates to see if there was a relationship between the polysemy of a verb and its misclassification. The number of senses (according to WordNet) was used to indicate the polysemy of a verb. The Mann Whitney U test was performed on the verbs found to be true negative and false positive using the Rat. A significant relationship was not found between participation and misclassification. Both groups had an average of 5 senses per verb.</Paragraph> <Paragraph position="4"> This is not to say that distinguishing verb senses would not improve performance, provided that there was sufficient data. However, verb polysemy does not appear to be a major source of error, from our preliminary analysis. In many eases, such as read which was classified both by the judges, and the system as a negative candidate, the predominant sense of the verb provides the majority of the data. Alternate senses, for example, the book reads well, often do not contribute enough data so as to give rise to a large proportion of errors. Finding an appropriate inventory of senses would be difficult, since we would not wish to separate related senses which occur as alternate variants of one another. The inventory would therefore require knowledge of the phenomena that we are endeavouring to acquire automatically. To show that our method will work for other RSAS, we use the conative. Our sample size is rather small since we are limited by the number of positive candidates in the corpus having sufficient frequency for both sets. The sparse data problem is acute when we look at alternations with specific prepositions. A sample of 12 verbs (6 positive and 6 negative) remained after the selection process outlined above. For this small sample we obtained a significant result (p = 0.02) with a mean accuracy of 67% and a median accuracy of 83%. On this occasion, the median performed better than the mean. More data is required to see if this difference is significant.</Paragraph> </Section> <Section position="2" start_page="259" end_page="260" type="sub_section"> <SectionTitle> 4.2 Using Syntax and Lemmas </SectionTitle> <Paragraph position="0"> This experiment was conducted using the same data as that used in the previous subsection. In this experiment, we used a similarity score on the argument heads directly, instead of generalizing the argument heads to WordNet classes. The venn diagram in figure 3 shows a subset of the lemmas at the transitive and intransitive SCFs for the verb break.</Paragraph> <Paragraph position="1"> The lemma based similarity measure is termed lemmaoverlap (LO) and is given in equation 3, where A and B represent the target slots. LO is the size of the intersection of the multisets of argument heads at the target slots, divided by the size of the smaller of the two multisets. The intersection of two multisets includes duplicate items only as many times as the item is in both sets. For example, if one slot contained the argument heads {person, person, person, child, man, spokeswoman}, and the other slot contained {person, person, child, chair, collection}, then the intersection would be {person, per3 son, child}, and LO would be g. This measure ranges between zero (no overlap) and I (where one set is a proper subset of that at the other slot).</Paragraph> <Paragraph position="3"> Using the Mann Whitney U test on the LO scores, we obtained a z score of 2.00. This is significant to the 95% level, a lower level than that for the class-based experiments. The results using the mean and median of the LO scores are shown in table 3. Performance is lower than that for the class-based experiments. The outcome for the individual verbs using the mean as a threshold was:* True negatives: add admit answer borrow choose climb cost declare demand drink eat feel imagine notice outline pack paint perform plan practise prescribe proclaim read remain sing steal suck survive understand wash win write * True positives: bend boil burn change close cool dry end fly improve increase match melt open ring roll shut slam smash Mart stop tilt wake * False negatives: accelerate bang break cook crack decrease drop expand flood land march repeat rip rock shatter snap spill split spread stretch swing terminate turn * False positives: ask attack believe catch expect help kick knit know miss pay pull remember warn Interestingly, the errors for the LO measure tend to be false negatives, rather than false positives. The LO measure is much more conservative than the approach using the TCMS. In this case the median threshold produces better results.</Paragraph> <Paragraph position="4"> For the conative alternation, the lemma based method does not show a significant relationship between participation and the LO scores. Moreover, there is no difference between the sums of the ranks of the two groups for the Mann Whitney U test. The mean produces an accuracy of 58% whilst the median produces an accuracy of 50%.</Paragraph> </Section> </Section> <Section position="6" start_page="260" end_page="261" type="metho"> <SectionTitle> 5 Related Work </SectionTitle> <Paragraph position="0"> There has been some recent interest in observing alternations in corpora (McCarthy and Korhonen, 1998; Lapata, 1999) and predicting related verb classifications (Stevenson and Merlo, 1999). Earlier work by Resnik (1993) demonstrated a link between selectional preference strength and participation in alternations where the direct object is omitted. Resnik used syntactic information from the bracketing within the Penn Treebank corpus. Research into the identification of other diathesis alternations has been advanced by the availability of automatic syntactic processing. Most work using corpus evidence for verb classification has relied on a priori knowledge in the form of linguistic cues specific to the phenomena being observed (Lapata, 1999; Stevenson and Merlo, 1999). Our approach, whilst being applicable only to RSAs, does not require human input specific to the alternation at hand.</Paragraph> <Paragraph position="1"> Lapata (1999) identifies participation in the dative and benefactive alternations. Lapata's strategy is to identify participants using a shallow parser and various linguistic and semantic cues, which are specified manually for these two alternations. PP attachments are resolved using Hindle and Rooth's (1993) lexical association score. Compound nouns, which could be mistaken for the double object construction, were filtered using the log-likelihood ratio test. The semantic cues were obtained by manual analysis. The relative frequency of a SCF for a verb, compared to the total frequency of the verb, was used for filtering out erroneous SCFs.</Paragraph> <Paragraph position="2"> Lapata does not report recall and precision figures against a gold standard. The emphasis is on the phenomena actually evident in the corpus data.</Paragraph> <Paragraph position="3"> Many of the verbs listed in Levin as taking the alternation were not observed with this alternation in the corpus data. This amounted to 44% of the verbs for the benefactive, and 52% for the dative.</Paragraph> <Paragraph position="4"> These figures only take into account the verbs for which at least one of the SCFS were observed. 54% of the verbs listed for the dative and benefactive by Levin were not acquired with either of the target SCFs. Conversely, many verbs not listed in Levin were identified as taking the benefactive or dative alternation using Lapata's criteria. Manual analysis of these verbs revealed 18 false positives out of 52 candidates.</Paragraph> <Paragraph position="5"> Stevenson and Merlo (1999) use syntactic and lexical cues for classifying 60 verbs in three verb classes: unergative, unaccusative and verbs with an optional direct object. These three classes were chosen be- null cause a few well defined features, specified a priori, can distinguish the three groups. Twenty verbs from Levin's classification were used in each class.</Paragraph> <Paragraph position="6"> They were selected by virtue of having sufficient frequency in a combined corpus (from the Brown and the wsJ) of 65 million words. The verbs were also chosen for having one predominant intended sense in the corpus. Stevenson and Merlo used four linguistically motivated features to distinguish these groups. Counts from the corpus data for each of the four features were normalised to give a score on a scale of 1 to I00. One feature was the causative non-causative distinction. For this feature, a measure similar to our LO measure was used. The four features were identified in the corpus using automatic POS tagging and parsing of the data. The data for half of the verbs in each class was subject to manual scrutiny, after initial automatic processing. The rest of the data was produced fully automatically. The verbs were classified automatically using the four features.</Paragraph> <Paragraph position="7"> The accuracy of automatic classification was 52% using all four features, compared to a baseline of 33%. The best result was obtained using a combination of three features. This gave an accuracy of 66%.</Paragraph> <Paragraph position="8"> McCarthy and Korhonen (1998) proposed a method for identifying rtSAS using MDL. This method relied on an estimation of the cost of using TCMS to encode the argument head data at a target slot. The sum of the costs for the two target slots was compared to the cost of a TCM for encoding the union of the argument head data over the two slots. Results are reported for the causative alternation with 15 verbs. This method depends on there being similar quantities of data at the alternating slots, otherwise the data at the more frequent slot overwhelms the data at the less frequent slot. However, many alternations involve SCFs with substantially different relative frequencies, especially when one SCF is specific to a particular preposition. We carried out some experiments using the MDL method and our TCMs. For the causative, we used a sample of 110 verbs and obtained 63% accuracy. For the conative, a sample of 16 verbs was used and this time accuracy was only 56%. Notably, only one negative decision was made because of the disparate frame frequencies, which reduces the cost of combining the argument head data.</Paragraph> </Section> class="xml-element"></Paper>