File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/p01-1046_metho.xml

Size: 28,446 bytes

Last Modified: 2025-10-06 14:07:37

<?xml version="1.0" standalone="yes"?>
<Paper uid="P01-1046">
  <Title>Evaluating Smoothing Algorithms against Plausibility Judgements</Title>
  <Section position="3" start_page="0" end_page="3" type="metho">
    <SectionTitle>
2 Smoothing Methods
</SectionTitle>
    <Paragraph position="0"> Smoothing techniques have been used in a variety of statistical natural language processing applications as a means to address data sparseness, an inherent problem for statistical methods which rely on the relative frequencies of word combinations.</Paragraph>
    <Paragraph position="1"> The problem arises when the probability of word combinations that do not occur in the training data needs to be estimated. The smoothing methods proposed in the literature (overviews are provided by Dagan et al. (1999) and Lee (1999)) can be generally divided into three types: discounting (Katz, 1987), class-based smoothing (Resnik, 1993; Brown et al., 1992; Pereira et al., 1993), and distance-weighted averaging (Grishman and Sterling, 1994; Dagan et al., 1999).</Paragraph>
    <Paragraph position="2"> Discounting methods decrease the probability of previously seen events so that the total probability of observed word co-occurrences is less than one, leaving some probability mass to be redistributed among unseen co-occurrences.</Paragraph>
    <Paragraph position="3"> Class-based smoothing and distance-weighted averaging both rely on an intuitively simple idea: inter-word dependencies are modelled by relying on the corpus evidence available for words that are similar to the words of interest. The two approaches differ in the way they measure word similarity. Distance-weighted averaging estimates word similarity from lexical co-occurrence information, viz., it finds similar words by taking into account the linguistic contexts in which they occur: two words are similar if they occur in similar contexts. In class-based smoothing, classes are used as the basis according to which the co-occurrence probability of unseen word combinations is estimated. Classes can be induced directly from the corpus (Pereira et al., 1993; Brown et al., 1992) or taken from a manually crafted taxonomy (Resnik, 1993). In the latter case the taxonomy is used to provide a mapping from words to conceptual classes.</Paragraph>
    <Paragraph position="4"> In language modelling, smoothing techniques are typically evaluated by showing that a language model which uses smoothed estimates incurs a reduction in perplexity on test data over a model that does not employ smoothed estimates (Katz, 1987). Dagan et al. (1999) use perplexity to compare back-off smoothing against distance-weighted averaging methods and show that the latter outperform the former. They also compare different distance-weighted averaging methods on a pseudo-word disambiguation task where the language model decides which of two verbs v  ;n) is a valid verb-object combination.</Paragraph>
    <Paragraph position="5"> In our experiments we recreated co-occurrence frequencies for unseen adjective-noun pairs using two different approaches: taxonomic class-based smoothing and distance-weighted averaging.</Paragraph>
    <Paragraph position="6">  We evaluated the recreated frequencies by comparing them with plausibility judgements elicited from human subjects. In contrast to previous work, this type of evaluation does not presuppose that the recreated frequencies are needed for a specific natural language processing task. Rather, our aim is to establish an independent criterion for the validity of smoothing techniques by comparing them to plausibility judgements, which are known to correlate with co-occurrence frequency (Lapata et al., 1999).</Paragraph>
    <Paragraph position="7"> In the remainder of this paper we present class- null Discounting methods were not included as Dagan et al. (1999) demonstrated that distance-weighted averaging achieves better language modelling performance than back-off.</Paragraph>
    <Paragraph position="8"> based smoothing and distance-weighted averaging as applied to unseen adjective-noun combinations (see Sections 2.1 and 2.2). Section 3 details our judgement elicitation experiment and reports our results.</Paragraph>
    <Section position="1" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
2.1 Class-based Smoothing
</SectionTitle>
      <Paragraph position="0"> We recreated co-occurrence frequencies for unseen adjective-noun pairs using a simplified version of Resnik's (1993) selectional association measure. Selectional association is defined as the amount of information a given predicate carries about its argument, where the argument is represented by its corresponding classes in a taxonomy such as WordNet (Miller et al., 1990). This means that predicates which impose few restrictions on their arguments have low selectional association values, whereas predicates selecting for a restricted number of arguments have high selectional association values. Consider the verbs see and polymerise: intuitively there is a great variety of things which can be seen, whereas there is a very specific set of things which can be polymerised (e.g., ethylene). Resnik demonstrated that his measure of selectional association successfully captures this intuition: selectional association values are correlated with verb-argument plausibility as judged by native speakers. null However, Lapata et al. (1999) found that the success of selectional association as a predictor of plausibility does not seem to carry over to adjective-noun plausibility. There are two potential reasons for this: (1) the semantic restrictions that adjectives impose on the nouns with which they combine appear to be less strict than the ones imposed by verbs (consider the adjective superb which can combine with nearly any noun); and (2) given their lexicalist nature, adjective-noun combinations may defy selectional restrictions yet be intuitively plausible (consider the pair sad day, where sadness is not an attribute of day).</Paragraph>
      <Paragraph position="1"> To address these problems, we replaced Resnik's information-theoretic measure with a simpler measure which makes no assumptions with respect to the contribution of a semantic class to the total quantity of information provided by the predicate about the semantic classes of its argument. We simply substitute the noun occurring in the adjective-noun combination with the concept by which it is represented in the taxonomy and estimate the adjective-noun co-occurrence frequency by counting the number of times the concept corresponding to the noun is observed to co-occur with the adjective in the corpus. Because a given word is not always represented by a single class in the taxonomy (i.e., the</Paragraph>
    </Section>
    <Section position="2" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
ing WordNet
</SectionTitle>
      <Paragraph position="0"> noun co-occurring with an adjective can generally be the realisation of one of several conceptual classes), we constructed the frequency counts for an adjective-noun pair for each conceptual class by dividing the contribution from the adjective by the number of classes to which it belongs (Lauer, 1995; Resnik, 1993):  belongs to. Note that the estimation of the frequency f (a;c) relies on the simplifying assumption that the noun co-occurring with the adjective is distributed evenly across its conceptual classes. This simplification is necessary unless we have a corpus of adjective-noun pairs labelled explicitly with taxonomic information. null  Consider the pair proud chief which is not attested in the British National Corpus (BNC) (Burnard, 1995). The word chief has two senses in WordNet and belongs to seven conceptual classes (hcausal agenti, hentityi, hleaderi, hlife formi, hpersoni, hsuperiori, and hsupervisori) This means that the co-occurrence frequency of the adjective-noun pair will be constructed for each of the seven classes, as shown in Table 1. Suppose for example that we see the pair proud leader in the corpus. The word leader has two senses in WordNet and belongs to eight conceptual classes (hpersoni, hlife fromi, hentityi, hcausal agenti, hfeaturei, hmerchandisei, hcommodityi,and hobjecti). The words chief and leader have four conceptual classes in common, i.e., hpersoni and hlife formi, hentityi,andhcausal agenti.</Paragraph>
      <Paragraph position="1"> This means that we will increment the observed co-occurrence count of proud and hpersoni, proud and hlife formi, proud and hentityi, and proud and hcausal agenti by  There are several ways of addressing this problem, e.g., by discounting the contribution of very general classes by finding a suitable class to represent a given concept (Clark and Weir, 2001).</Paragraph>
      <Paragraph position="2"> do not know the actual class of the noun chief in the corpus, we weight the contribution of each class by taking the average of the constructed frequencies for all seven classes:  Based on (2) the recreated frequency for the pair proud chief in the BNC is 6.12 (see Table 1).</Paragraph>
    </Section>
    <Section position="3" start_page="2" end_page="3" type="sub_section">
      <SectionTitle>
2.2 Distance-Weighted Averaging
</SectionTitle>
      <Paragraph position="0"> Distance-weighted averaging induces classes of similar words from word co-occurrences without making reference to a taxonomy. A key feature of this type of smoothing is the function which measures distributional similarity from co-occurrence frequencies. Several measures of distributional similarity have been proposed in the literature (Dagan et al., 1999; Lee, 1999). We used two measures, the Jensen-Shannon divergence and the confusion probability. Those two measures have been previously shown to give promising performance for the task of estimating the frequencies of unseen verb-argument pairs (Dagan et al., 1999; Grishman and Sterling, 1994; Lapata, 2000; Lee, 1999). In the following we describe these two similarity measures and show how they can be used to recreate the frequencies for unseen adjective-noun pairs.</Paragraph>
      <Paragraph position="1"> Jensen-Shannon Divergence. The Jensen-Shannon divergence is an information-theoretic measure that recasts the concept of distributional similarity into a measure of the &amp;quot;distance&amp;quot; (i.e., dissimilarity) between two probability distributions.</Paragraph>
      <Paragraph position="2">  be an unseen sequence of two words whose distributional similarity is to be determined. Let P(w  The Kullback-Leibler divergence is an information-theoretic measure of the dissimilarity of two probability distributions p and q, defined as follows:</Paragraph>
      <Paragraph position="4"> In our case the distributions p and q are the conditional probability distributions P(w  have in common. The Jensen-Shannon divergence, a dissimilarity measure, is transformed to a similarity measure as follows:  contribute to the estimate, whereas if b is low, less similar words also contribute to the estimate.</Paragraph>
      <Paragraph position="5"> Confusion Probability. The confusion probability is an estimate of the probability that word  text. These conditional probabilities can be easily estimated from their relative frequency in the corpus as follows:</Paragraph>
      <Paragraph position="7"> The performance of distance-weighted averaging depends on two parameters: (1) the number of items over which the similarity function is computed (i.e., the size of the set S(w  and the ten most similar nouns to chief value of the parameter b (which is only relevant for the Jensen-Shannon divergence). In this study we recreated adjective-noun frequencies using the 1,000 and 2,000 most frequent items (nouns and adjectives), for both the confusion probability and the Jensen-Shannon divergence.   Furthermore, we set b to .5, which experiments showed to be the best value for this parameter.</Paragraph>
      <Paragraph position="8"> Once we know which words are most similar to the either the adjective or the noun (irrespective of the function used to measure similarity) we can exploit this information in order to recreate the co-occurrence frequency for unseen adjective-noun pairs. We use the weighted average of the evidence provided by the similar words, where the weight given to a word w  shows the ten most similar adjectives to the word proud and then the ten most similar nouns to the word chief using the Jensen-Shannon divergence and the confusion probability. Here the similarity function was calculated over the 1,000 most frequent adjectives in the BNC.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="3" end_page="3" type="metho">
    <SectionTitle>
3 Collecting Plausibility Ratings
</SectionTitle>
    <Paragraph position="0"> In order to evaluate the smoothing methods introduced above, we first needed to establish an independent measure of plausibility. The standard approach used in experimental psycholinguistics is to elicit judgements from human subjects; in this section we describe our method for assembling the set of experimental materials and collecting plausibility ratings for these stimuli.</Paragraph>
    <Section position="1" start_page="3" end_page="3" type="sub_section">
      <SectionTitle>
3.1 Method
</SectionTitle>
      <Paragraph position="0"> Materials. We used a part-of-speech annotated, lemmatised version of the BNC. The BNC is a large, balanced corpus of British English, consisting of 90 million words of text and 10 million words of speech. Frequency information obtained  These were shown to be the best parameter settings by Lapata (2000). Note that considerable latitude is available when setting these parameters; there are 151,478 distinct adjective types and 367,891 noun types in the BNC.</Paragraph>
      <Paragraph position="1"> Adjective Nouns hungry tradition innovation prey guilty system wisdom wartime temporary conception surgery statue naughty regime rival protocol</Paragraph>
    </Section>
    <Section position="2" start_page="3" end_page="3" type="sub_section">
      <SectionTitle>
judgement experiment
</SectionTitle>
      <Paragraph position="0"> from the BNC can be expected to be a reasonable approximation of the language experience of a British English speaker.</Paragraph>
      <Paragraph position="1"> The experiment used the same set of 30 adjectives discussed in Lapata et al. (1999). These adjectives were chosen to be minimally ambiguous: each adjective had exactly two senses according to WordNet and was unambiguously tagged as 'adjective' 98.6% of the time, measured as the number of different part-of-speech tags assigned to the word in the BNC. For each adjective we obtained all the nouns (excluding proper nouns) with which it failed to co-occur in the BNC.</Paragraph>
      <Paragraph position="2"> We identified adjective-noun pairs by using Gsearch (Corley et al., 2001), a chart parser which detects syntactic patterns in a tagged corpus by exploiting a user-specified context free grammar and a syntactic query. From the syntactic analysis provided by the parser we extracted a table containing the adjective and the head of the noun phrase following it. In the case of compound nouns, we only included sequences of two nouns, and considered the rightmost occurring noun as the head. From the adjective-noun pairs obtained this way, we removed all pairs where the noun had a BNC frequency of less than 10 per million, in order to reduce the risk of plausibility ratings being influenced by the presence of a noun unfamiliar to the subjects. Each adjective was then paired with three randomly-chosen nouns from its list of non-co-occurring nouns. Example stimuli are shown in Table 3.</Paragraph>
      <Paragraph position="3"> Procedure. The experimental paradigm was magnitude estimation (ME), a technique standardly used in psychophysics to measure judgements of sensory stimuli (Stevens, 1975), which Bard et al. (1996) and Cowart (1997) have applied to the elicitation of linguistic judgements. The ME procedure requires subjects to estimate the magnitude of physical stimuli by assigning numerical values proportional to the stimulus magnitude they perceive. In contrast to the 5- or 7-point scale conventionally used to measure human intuitions, ME employs an interval scale, and therefore produces data for which parametric inferential statistics are valid.</Paragraph>
      <Paragraph position="4"> ME requires subjects to assign numbers to a series of linguistic stimuli in a proportional  the five smoothed frequency estimates fashion. Subjects are first exposed to a modulus item, which they assign an arbitrary number. All other stimuli are rated proportional to the modulus. In this way, each subject can establish their own rating scale, thus yielding maximally finegraded data and avoiding the known problems with the conventional ordinal scales for linguistic data (Bard et al., 1996; Cowart, 1997; Sch&amp;quot;utze, 1996).</Paragraph>
      <Paragraph position="5"> In the present experiment, subjects were presented with adjective-noun pairs and were asked to rate the degree of adjective-noun fit proportional to a modulus item. The experiment was carried out using WebExp, a set of Java-Classes for administering psycholinguistic studies over the World-Wide Web (Keller et al., 1998). Subjects first saw a set of instructions that explained the ME technique and included some examples, and had to fill in a short questionnaire including basic demographic information. Each subject saw the entire set of 90 experimental items.</Paragraph>
      <Paragraph position="6"> Subjects. Forty-one native speakers of English volunteered to participate. Subjects were recruited over the Internet by postings to relevant newsgroups and mailing lists.</Paragraph>
    </Section>
    <Section position="3" start_page="3" end_page="3" type="sub_section">
      <SectionTitle>
3.2 Results
</SectionTitle>
      <Paragraph position="0"> Correlation analysis was used to assess the degree of linear relationship between plausibility ratings (Plaus) and the three smoothed co-occurrence frequency estimates: distance-weighted averaging using Jensen-Shannon divergence (Jen), distance-weighted averaging using confusion probability (Conf), and class-based smoothing using Word-Net (WN). For the two similarity-based measures, we smoothed either over the similarity of the adjective (subscript a) or over the similarity of the noun (subscript n). All frequency estimates were natural log-transformed.</Paragraph>
      <Paragraph position="1"> Table 4 displays the results of the correlation analysis. Mean plausibility ratings were significantly correlated with co-occurrence frequency recreated using our class-based smoothing method based on WordNet (r = :356, p &lt; :01).</Paragraph>
      <Paragraph position="2"> As detailed in Section 2.2, the Jensen-Shannon divergence and the confusion probability are parameterised measures. There are two ways to smooth the frequency of an adjective-noun combination: over the distribution of adjectives or over the distribution of nouns. We tried both approaches and found a moderate correlation between plausibility and both the frequency recreated using distance-weighted averaging and confusion probability. The correlation was significant both for frequencies recreated by smoothing over adjectives (r = :214, p &lt;:05) and over nouns (r = :232, p &lt;:05). However, co-occurrence frequency recreated using the Jensen-Shannon divergence was not reliably correlated with plausibility. Furthermore, there was a reliable correlation between the two Jensen-Shannon measures</Paragraph>
      <Paragraph position="4"> This indicates that the two similarity measures yield comparable results for the given task.</Paragraph>
      <Paragraph position="5"> We also examined the effect of varying one further parameter (see Section 2.2). The recreated frequencies were initially estimated using the n = 1;000 most similar items. We examined the effects of applying the two smoothing methods using a set of similar items of twice the size (n = 2;000). No improvement in terms of the correlations with rated plausibility was found when using this larger set, whether smoothing over the adjective or the noun: a moderate correlation with plausibility was found for Conf  was not significant.</Paragraph>
      <Paragraph position="6"> An important question is how well people agree in their plausibility judgements. Inter-subject agreement gives an upper bound for the task and allows us to interpret how well the smoothing techniques are doing in relation to the human judges. We computed the inter-subject correlation on the elicited judgements using leave-one-out re-sampling (Weiss and Kulikowski, 1991). Average inter-subject agreement was :55 (Min = :01, Max = :76, SD = :16). This means that our approach performs satisfactorily given that there is a fair amount of variability in human judgements of adjective-noun plausibility.</Paragraph>
      <Paragraph position="7"> One remaining issue concerns the validity of our smoothing procedures. We have shown that co-occurrence frequencies recreated using smoothing techniques are significantly correlated with rated plausibility. But this finding constitutes only indirect evidence for the ability of this method to recreate corpus evidence; it depends on the assumption that plausibility and frequency are adequate indicators of each other's values. Does  actual frequencies and plausibility (using Lapata et al.'s (1999) stimuli) smoothing accurately recreate the co-occurrence frequency of combinations that actually do occur in the corpus? To address this question, we applied the class-based smoothing procedure to a set of adjective-noun pairs that occur in the corpus with varying frequencies, using the materials from Lapata et al. (1999).</Paragraph>
      <Paragraph position="8"> First, we removed all relevant adjective-noun combinations from the corpus. Effectively we assumed a linguistic environment with no evidence for the occurrence of the pair, and thus no evidence for any linguistic relationship between the adjective and the noun. Then we recreated the co-occurrence frequencies using class-based smoothing and distance-weighted averaging, and log-transformed the resulting frequencies. Both methods yielded reliable correlation between recreated frequency and actual BNC frequency (see Table 5 for details). This result provides additional evidence for the claim that these smoothing techniques produce reliable frequency estimates for unseen adjective-noun pairs. Note that the best correlations were achieved for Conf a and Conf n (r = :646, p &lt;:01 and r = :728, p &lt; :01, respectively).</Paragraph>
      <Paragraph position="9"> Finally, we carried out a further test of the quality of the recreated frequencies by correlating them with the plausibility judgements reported by Lapata et al. (1999). Again, a significant correlation was found for all methods (see Table 5). However, all correlations were lower than the correlation of the actual frequencies with plausibility (r = :570, p &lt;:01) reported by Lapata et al. (1999). Note also that the confusion probability outperformed Jensen-Shannon divergence, in line with our results on unfamiliar adjective-noun pairs.</Paragraph>
    </Section>
    <Section position="4" start_page="3" end_page="3" type="sub_section">
      <SectionTitle>
3.3 Discussion
</SectionTitle>
      <Paragraph position="0"> Lapata et al. (1999) demonstrated that the co-occurrence frequency of an adjective-noun combination is the best predictor of its rated plausibility. The present experiment extended this result to adjective-noun pairs that do not co-occur in the corpus.</Paragraph>
      <Paragraph position="1"> We applied two smoothing techniques in order to recreate co-occurrence frequency and found that the class-based smoothing method was the best predictor of plausibility. This result is interguilty dangerous stop giant guilty dangerous stop giant interested certain moon company innocent different employment manufacturer injured particular length artist labour difficult detail industry socialist other page firm strange strange time star democratic similar potential master ruling various list army honest bad turn rival  tives guilty and dangerous and the nouns stop and giant discovered by the Jensen-Shannon measure esting because the class-based method does not use detailed knowledge about word-to-word relationships in real language; instead, it relies on the notion of equivalence classes derived from Word-Net, a semantic taxonomy. It appears that making predictions about plausibility is most effectively done by collapsing together the speaker's experience with other words in the semantic class occupied by the target word.</Paragraph>
      <Paragraph position="2"> The distance-weighted averaging smoothing methods yielded a lower correlation with plausibility (in the case of the confusion probability), or no correlation at all (in the case of the Jensen-Shannon divergence). The worse performance of distance-weighted averaging is probably due to the fact that this method conflates two kinds of distributional similarity: on the one hand, it generates words that are semantically similar to the target word. On the other hand, it also generates words whose syntactic behaviour is similar to that of the target word. Rated plausibility, however, seems to be more sensitive to semantic than to syntactic similarity.</Paragraph>
      <Paragraph position="3"> As an example refer to Table 6, which displays the ten most distributionally similar words to the adjectives guilty and dangerous and to the nouns stop and giant discovered by the Jensen-Shannon measure. The set of similar words is far from semantically coherent. As far as the adjective guilty is concerned the measure discovered antonyms such as innocent and honest. Semantically unrelated adjectives such as injured, democratic,orinterested are included; it seems that their syntactic behaviour is similar to that of guilty, e.g., they all co-occur with party. The same pattern can be observed for the adjective dangerous, to which none of the discovered adjectives are intuitively semantically related, perhaps with the exception of bad. The set of words most similar to the noun stop also does not appear to be semantically coherent. This problem with distance-weighted averaging is aggravated by the fact that the adjective or noun that we smooth over can be polysemous.</Paragraph>
      <Paragraph position="4"> Take the set of similar words for giant,forinstance. The words company, manufacturer, industry and firm are similar to the 'enterprise' sense of giant, whereas artist, star, master are similar to the 'important/influential person' sense of giant. However, no similar word was found for either the 'beast' or 'heavyweight person' sense of giant. This illustrates that the distance-weighted averaging approach fails to take proper account of the polysemy of a word. The class-based approach, on the other hand, relies on WordNet, a lexical taxonomy that can be expected to cover most senses of a given lexical item.</Paragraph>
      <Paragraph position="5"> Recall that distance-weighted averaging discovers distributionally similar words by looking at simple lexical co-occurrence information. In the case of adjective-noun pairs we concentrated on combinations found in the corpus in a head-modifier relationship. This limited form of surface-syntactic information does not seem to be sufficient to reproduce the detailed knowledge that people have about the semantic relationships between words. Our class-based smoothing method, on the other hand, relies on the semantic taxonomy of WordNet, where fine-grained conceptual knowledge about words and their relations is encoded. This knowledge can be used to create semantically coherent equivalence classes.</Paragraph>
      <Paragraph position="6"> Such classes will not contain antonyms or items whose behaviour is syntactically related, but not semantically similar, to the words of interest.</Paragraph>
      <Paragraph position="7"> To summarise, it appears that distance-weighted averaging smoothing is only partially successful in reproducing the linguistic dependencies that characterise and constrain the formation of adjective-noun combinations. The class-based smoothing method, however, relies on a pre-defined taxonomy that allows these dependencies to be inferred, and thus reliably estimates the plausibility of adjective-noun combinations that fail to co-occur in the corpus.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML