File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/h05-1067_metho.xml

Size: 16,709 bytes

Last Modified: 2025-10-06 14:09:34

<?xml version="1.0" standalone="yes"?>
<Paper uid="H05-1067">
  <Title>Making Computers Laugh: Investigations in Automatic Humor Recognition</Title>
  <Section position="4" start_page="533" end_page="534" type="metho">
    <SectionTitle>
3 Automatic Humor Recognition
</SectionTitle>
    <Paragraph position="0"> We experiment with automatic classification techniques using: (a) heuristics based on humor-specific stylistic features (alliteration, antonymy, slang); (b) content-based features, within a learning framework formulated as a typical text classification task; and (c) combined stylistic and content-based features, integrated in a stacked machine learning framework.</Paragraph>
    <Section position="1" start_page="533" end_page="534" type="sub_section">
      <SectionTitle>
3.1 Humor-Specific Stylistic Features
</SectionTitle>
      <Paragraph position="0"> Linguistic theories of humor (Attardo, 1994) have suggested many stylistic features that characterize humorous texts. We tried to identify a set of features that were both significant and feasible to implement using existing machine readable resources.</Paragraph>
      <Paragraph position="1"> Specifically, we focus on alliteration, antonymy, and adult slang, which were previously suggested as potentially good indicators of humor (Ruch, 2002; Bucaria, 2004).</Paragraph>
      <Paragraph position="2"> Alliteration. Some studies on humor appreciation (Ruch, 2002) show that structural and phonetic properties of jokes are at least as important as their content. In fact one-liners often rely on the reader's awareness of attention-catching sounds, through linguistic phenomena such as alliteration, word repetition and rhyme, which produce a comic effect even if the jokes are not necessarily meant to be read aloud.</Paragraph>
      <Paragraph position="3"> Note that similar rhetorical devices play an important role in wordplay jokes, and are often used in newspaper headlines and in advertisement. The following one-liners are examples of jokes that include one or more alliteration chains: Veni, Vidi, Visa: I came, I saw, I did a little shopping. Infants don't enjoy infancy like adults do adultery.</Paragraph>
      <Paragraph position="4"> To extract this feature, we identify and count the number of alliteration/rhyme chains in each example in our data set. The chains are automatically extracted using an index created on top of the CMU pronunciation dictionary2.</Paragraph>
      <Paragraph position="5"> Antonymy. Humor often relies on some type of incongruity, opposition or other forms of apparent contradiction. While an accurate identification of all these properties is probably difficult to accomplish, it is relatively easy to identify the presence of antonyms in a sentence. For instance, the comic effect produced by the following one-liners is partly due to the presence of antonyms: A clean desk is a sign of a cluttered desk drawer.</Paragraph>
      <Paragraph position="6"> Always try to be modest and be proud of it! The lexical resource we use to identify antonyms is WORDNET (Miller, 1995), and in particular the antonymy relation among nouns, verbs, adjectives and adverbs. For adjectives we also consider an indirect antonymy via the similar-to relation among adjective synsets. Despite the relatively large number of antonymy relations defined in WORDNET, its coverage is far from complete, and thus the antonymy feature cannot always be identified. A deeper semantic analysis of the text, such as word sense disambiguation or domain disambiguation, could probably help detecting other types of semantic opposition, and we plan to exploit these techniques in future work.</Paragraph>
      <Paragraph position="7"> Adult slang. Humor based on adult slang is very popular. Therefore, a possible feature for humor-recognition is the detection of sexual-oriented lexicon in the sentence. The following represent examples of one-liners that include such slang: The sex was so good that even the neighbors had a cigarette. Artificial Insemination: procreation without recreation. To form a lexicon required for the identification of this feature, we extract from WORDNET DOMAINS3 all the synsets labeled with the domain SEXUALITY.</Paragraph>
      <Paragraph position="8"> The list is further processed by removing all words with high polysemy ([?] 4). Next, we check for the presence of the words in this lexicon in each sentence in the corpus, and annotate them accordingly.</Paragraph>
      <Paragraph position="9"> Note that, as in the case of antonymy, WORDNET coverage is not complete, and the adult slang feature cannot always be identified.</Paragraph>
      <Paragraph position="10"> Finally, in some cases, all three features (alliteration,  antonymy, adult slang) are present in the same sentence, as for instance the following one-liner: Behind every greatal manant is a greatal womanant, and behind every greatal womanant is some guy staring at her behindsl!</Paragraph>
    </Section>
    <Section position="2" start_page="534" end_page="534" type="sub_section">
      <SectionTitle>
3.2 Content-based Learning
</SectionTitle>
      <Paragraph position="0"> In addition to stylistic features, we also experimented with content-based features, through experiments where the humor-recognition task is formulated as a traditional text classification problem.</Paragraph>
      <Paragraph position="1"> Specifically, we compare results obtained with two frequently used text classifiers, Na&amp;quot;ive Bayes and Support Vector Machines, selected based on their performance in previously reported work, and for their diversity of learning methodologies.</Paragraph>
      <Paragraph position="2"> Na&amp;quot;ive Bayes. The main idea in a Na&amp;quot;ive Bayes text classifier is to estimate the probability of a category given a document using joint probabilities of words and documents. Na&amp;quot;ive Bayes classifiers assume word independence, but despite this simplification, they perform well on text classification. While there are several versions of Na&amp;quot;ive Bayes classifiers (variations of multinomial and multivariate Bernoulli), we use the multinomial model, previously shown to be more effective (McCallum and Nigam, 1998).</Paragraph>
      <Paragraph position="3"> Support Vector Machines. Support Vector Machines (SVM) are binary classifiers that seek to find the hyperplane that best separates a set of positive examples from a set of negative examples, with maximum margin. Applications of SVM classifiers to text categorization led to some of the best results reported in the literature (Joachims, 1998).</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="534" end_page="536" type="metho">
    <SectionTitle>
4 Experimental Results
</SectionTitle>
    <Paragraph position="0"> Several experiments were conducted to gain insights into various aspects related to an automatic humor recognition task: classification accuracy using stylistic and content-based features, learning rates, impact of the type of negative data, impact of the classification methodology.</Paragraph>
    <Paragraph position="1"> All evaluations are performed using stratified ten-fold cross validations, for accurate estimates. The baseline for all the experiments is 50%, which represents the classification accuracy obtained if a label of &amp;quot;humorous&amp;quot; (or &amp;quot;non-humorous&amp;quot;) would be assigned by default to all the examples in the data set.</Paragraph>
    <Paragraph position="2"> Experiments with uneven class distributions were also performed, and are reported in section 4.4.</Paragraph>
    <Section position="1" start_page="534" end_page="534" type="sub_section">
      <SectionTitle>
4.1 Heuristics using Humor-specific Features
</SectionTitle>
      <Paragraph position="0"> In a first set of experiments, we evaluated the classification accuracy using stylistic humor-specific features: alliteration, antonymy, and adult slang. These are numerical features that act as heuristics, and the only parameter required for their application is a threshold indicating the minimum value admitted for a statement to be classified as humorous (or nonhumorous). These thresholds are learned automatically using a decision tree applied on a small subset of humorous/non-humorous examples (1000 examples). The evaluation is performed on the remaining 15,000 examples, with results shown in Table 24.</Paragraph>
      <Paragraph position="1">  tion, antonymy, and adult slang.</Paragraph>
      <Paragraph position="2"> Considering the fact that these features represent stylistic indicators, the style of Reuters titles turns out to be the most different with respect to oneliners, while the style of proverbs is the most similar. Note that for all data sets the alliteration feature appears to be the most useful indicator of humor, which is in agreement with previous linguistic findings (Ruch, 2002).</Paragraph>
    </Section>
    <Section position="2" start_page="534" end_page="535" type="sub_section">
      <SectionTitle>
4.2 Text Classification with Content Features
</SectionTitle>
      <Paragraph position="0"> The second set of experiments was concerned with the evaluation of content-based features for humor recognition. Table 3 shows results obtained using the three different sets of negative examples, with the Na&amp;quot;ive Bayes and SVM text classifiers. Learning curves are plotted in Figure 2.</Paragraph>
      <Paragraph position="1">  different sets of negative examples: (a) Reuters; (b) BNC; (c) Proverbs. Once again, the content of Reuters titles appears to be the most different with respect to one-liners, while the BNC sentences represent the most similar data set. This suggests that joke content tends to be very similar to regular text, although a reasonably accurate distinction can still be made using text classification techniques. Interestingly, proverbs can be distinguished from one-liners using content-based features, which indicates that despite their stylistic similarity (see Table 2), proverbs and one-liners deal with different topics.</Paragraph>
    </Section>
    <Section position="3" start_page="535" end_page="535" type="sub_section">
      <SectionTitle>
4.3 Combining Stylistic and Content Features
</SectionTitle>
      <Paragraph position="0"> Encouraged by the results obtained in the first two experiments, we designed a third experiment that attempts to jointly exploit stylistic and content features for humor recognition. The feature combination is performed using a stacked learner, which takes the output of the text classifier, joins it with the three humor-specific features (alliteration, antonymy, adult slang), and feeds the newly created feature vectors to a machine learning tool. Given the relatively large gap between the performance achieved with content-based features (text classification) and stylistic features (humor-specific heuristics), we decided to implement the second learning stage in the stacked learner using a memory based learning system, so that low-performance features are not eliminated in the favor of the more accurate ones5. We use the Timbl memory based learner (Daelemans et al., 2001), and evaluate the classification using a stratified ten-fold cross validation. Table  learning based on stylistic and content features.</Paragraph>
      <Paragraph position="1"> Combining classifiers results in a statistically significant improvement (p &lt; 0.0005, paired t-test) with respect to the best individual classifier for the One-liners/Reuters and One-liners/BNC data sets, with relative error rate reductions of 8.9% and 7.3% respectively. No improvement is observed for the One-liners/Proverbs data set, which is not surprising since, as shown in Table 2, proverbs and one-liners cannot be clearly differentiated using stylistic features, and thus the addition of these features to content-based features is not likely to result in an improvement.</Paragraph>
    </Section>
    <Section position="4" start_page="535" end_page="536" type="sub_section">
      <SectionTitle>
4.4 Discussion
</SectionTitle>
      <Paragraph position="0"> The results obtained in the automatic classification experiments reveal the fact that computational approaches represent a viable solution for the task of humor-recognition, and good performance can be achieved using classification techniques based on stylistic and content features.</Paragraph>
      <Paragraph position="1"> Despite our initial intuition that one-liners are most similar to other creative texts (e.g. Reuters titles, or the sometimes almost identical proverbs), and thus the learning task would be more difficult in relation to these data sets, comparative experimental results show that in fact it is more difficult to distinguish humor with respect to regular text (e.g. BNC  sentences). Note however that even in this case the combined classifier leads to a classification accuracy that improves significantly over the apriori known baseline.</Paragraph>
      <Paragraph position="2"> An examination of the content-based features learned during the classification process reveals interesting aspects of the humorous texts. For instance, one-liners seem to constantly make reference to human-related scenarios, through the frequent use of words such as man, woman, person, you, I. Similarly, humorous texts seem to often include negative word forms, such as the negative verb forms doesn't, isn't, don't, or negative adjectives like wrong or bad. A more extensive analysis of content-based humor-specific features is likely to reveal additional humor-specific content features, which could also be used in studies of humor generation.</Paragraph>
      <Paragraph position="3"> In addition to the three negative data sets, we also performed an experiment using a corpus of arbitrary sentences randomly drawn from the three negative sets. The humor recognition with respect to this negative mixed data set resulted in 63.76% accuracy for stylistic features, 77.82% for content-based features using Na&amp;quot;ive Bayes and 79.23% using SVM. These figures are comparable to those reported in Tables 2 and 3 for One-liners/BNC, which suggests that the experimental results reported in the previous sections do not reflect a bias introduced by the negative data sets, since similar results are obtained when the humor recognition is performed with respect to arbitrary negative examples.</Paragraph>
      <Paragraph position="4"> As indicated in section 2.2, the negative examples were selected structurally and stylistically similar to the one-liners, making the humor recognition task more difficult than in a real setting. Nonetheless, we also performed a set of experiments where we made the task even harder, using uneven class distributions. For each of the three types of negative examples, we constructed a data set using 75% non-humorous examples and 25% humorous examples. Although the baseline in this case is higher (75%), the automatic classification techniques for humor-recognition still improve over this baseline.</Paragraph>
      <Paragraph position="5"> The stylistic features lead to a classification accuracy of 87.49% (One-liners/Reuters), 77.62% (Oneliners/BNC), and 76.20% (One-liners/Proverbs), and the content-based features used in a Na&amp;quot;ive Bayes classifier result in accuracy figures of 96.19% (One-liners/Reuters), 81.56% (One-liners/BNC), and 87.86% (One-liners/Proverbs).</Paragraph>
      <Paragraph position="6"> Finally, in addition to classification accuracy, we were also interested in the variation of classification performance with respect to data size, which is an aspect particularly relevant for directing future research. Depending on the shape of the learning curves, one could decide to concentrate future work either on the acquisition of larger data sets, or toward the identification of more sophisticated features. Figure 2 shows that regardless of the type of negative data, there is significant learning only until about 60% of the data (i.e. about 10,000 positive examples, and the same number of negative examples). The rather steep ascent of the curve, especially in the first part of the learning, suggests that humorous and non-humorous texts represent well distinguishable types of data. An interesting effect can be noticed toward the end of the learning, where for both classifiers the curve becomes completely flat (One-liners/Reuters, One-liners/Proverbs), or it even has a slight drop (One-liners/BNC). This is probably due to the presence of noise in the data set, which starts to become visible for very large data sets6.</Paragraph>
      <Paragraph position="7"> This plateau is also suggesting that more data is not likely to help improve the quality of an automatic humor-recognizer, and more sophisticated features are probably required.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML