File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-1625_metho.xml

Size: 16,893 bytes

Last Modified: 2025-10-06 14:10:47

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-1625">
  <Title>Humor: Prosody Analysis and Automatic Recognition for F * R * I * E * N * D * S *</Title>
  <Section position="5" start_page="208" end_page="209" type="metho">
    <SectionTitle>
3 Audio Segmentation and Annotation
</SectionTitle>
    <Paragraph position="0"> We segmented each audio file (manually) by marking speaker turn boundaries, using Wavesurfer (http://www.speech.kth.se/wavesurfer). We apply a fairly straightforward annotation scheme to automatically identify humorous and non-humorous turns in our corpus. Speaker turns that are followed by artificial laughs are labeled as Humorous, and all the rest as Non-Humorous. For example, in the dialog excerpt shown in figure 1, turns 3, 7, 9, 11 and 16 are marked as humorous, whereas turns 1, 2, 5, 6, 13, 14, 15 are marked as non-humorous. Artificial laughs, silences longer than 1 second and segments of audio that contain purely non-verbal sounds (such as phone rings, door bells, music etc.) were excluded from the analysis. By considering only  [1] Rachel: Guess what? [2] Ross: You got a job? [3] Rachel: Are you kidding? I am trained for nothing! [4] BOLaughterBQ [5] Rachel: I was laughed out of twelve interviews today.</Paragraph>
    <Paragraph position="1"> [6] Chandler: And yet you're surprisingly upbeat. null [7] Rachel: You would be too if you found John and David boots on sale, fifty percent off! [8] BOLaughterBQ [9] Chandler: Oh, how well you know me...</Paragraph>
    <Paragraph position="2"> [10] BOLaughterBQ [11] Rachel: They are my new, I don't need a job, I don't need my parents, I got great boots, boots! [12] BOLaughterBQ [13] Monica: How'd you pay for them? [14] Rachel: Uh, credit card.</Paragraph>
    <Paragraph position="3"> [15] Monica: And who pays for that? [16] Rachel: Um... my... father.</Paragraph>
    <Paragraph position="4"> [17] BOLaughterBQ  speaker turns that are followed by laughs as humorous, we also automatically eliminate cases of pure visual comedy where humor is expressed using only gestures or facial expressions. In short, non-verbal sounds or silences followed by laughs are not treated as humorous. Henceforth, by turn, we mean proper speaker turns (and not non-verbal turns). We currently do not apply any special filters to remove non-verbal sounds or background noise (other than laughs) that overlap with speaker turns. However, if artificial laughs overlap with a speaker turn (there were only few such instances), the speaker turn is chopped by marking a turn boundary exactly before/after the laughs begin/end. This is to ensure that our prosody analysis is fair and does not catch any cues from the laughs. In other words, we make sure that our speaker turns are clean and not garbled by laughs. After segmentation, we got a total of 1629 speaker turns, of which 714 (43.8%) are humorous, and 915 (56.2%) are non-humorous. We also made sure that there is a 1-to-1 correspondence between speaker turns in text transcripts that were obtained online and our audio segments, and corrected few cases where there was a mis-match (due to turn-chopping or errors in online transcripts).</Paragraph>
  </Section>
  <Section position="6" start_page="209" end_page="210" type="metho">
    <SectionTitle>
4 Speaker Distributions
</SectionTitle>
    <Paragraph position="0"> There are 6 main actors/speakers (3 male and 3 female) in this show, along with a number of (in our data 26) guest actors who appear briefly and rarely in some of our dialogs. As the number of guest actors is quite large, and their individual contribution is less than 5% of the turns in our data, we decided to group all the guest actors together in one GUEST class.</Paragraph>
    <Paragraph position="1"> As these are acted (not real) conversations, there were only few instances of speaker turnoverlaps, where multiple speakers speak together. These turns were given a speaker label MULTI. Table 1 shows the total number of turns and humorous turns for each speaker, along with their percentages in braces. Percentages for the Humor column show, out of the total (714) humorous turns, how many are by each speaker. As one can notice, the distribution of turns is fairly balanced among the six main speakers. We also notice that even though each guest actors' individual contribution is less than 5% in our data, their combined contribution is fairly large, almost 16% of the total turns. Table 2 shows that the six main actors together form a total of 83% of our data. Also, of the total 714 humorous turns, 615 (86%) turns are by the main actors. To study if prosody of humor differs across males and females, we also grouped the main actors into two gender classes. Table 2 shows that the gender distribution is fairly bal- null anced among the main actors, with 50.5% male and 49.5% female turns. We also see that of the 685 male turns, 347 turns (almost 50%) are humorous, and of the 672 female turns, 268 (approximately 40%) are humorous. Guest actors and multi-speaker turns are not considered in the gender analysis.</Paragraph>
  </Section>
  <Section position="7" start_page="210" end_page="210" type="metho">
    <SectionTitle>
5 Features
</SectionTitle>
    <Paragraph position="0"> Literature in emotional speech analysis (Liscombe et al., 2003)(Litman and Forbes-Riley, 2004) (Scherer, 2003)(Ang et al., 2002) has shown that prosodic features such as pitch, energy, speaking rate (tempo) are useful indicators of emotional states, such as joy, anger, fear, boredom etc. While humor is not necessarily considered as an emotional state, we noticed that most humorous utterances in our corpus (and also in general) often make use of hyper-articulations, similar to those found in emotional speech.</Paragraph>
    <Paragraph position="1"> For this study, we use a number of acoustic-prosodic as well as some non acoustic-prosodic features as listed below:  Our acoustic-prosodic features make use of the pitch, energy and temporal information in the speech signal, and are computed using Wavesurfer. Figure 2 shows Wavesurfer's energy (dB), pitch (Hz), and transcription (.lab) panes. The transcription interface shows text corresponding to the dialog turns, along with the turn boundaries. All features are computed at the turn level, and essentially measure the mean, maximum, minimum, range (maximum-minimum) and standard deviation of the feature value (F0 or RMS) over the entire turn (ignoring zeroes). Duration is measured in terms of time in seconds, from the beginning to the end of the turn including pauses (if any) in between. Internal silence is measured as the percentage of zero F0 frames, and essentially account for the amount of silence in the turn. Tempo is computed as the total number of syllables divided by the duration of the turn. For computing the number of syllables per word, we used the General Inquirer database (Stone et al., 1966).</Paragraph>
    <Paragraph position="2"> Our lexical features are simply all words (alphanumeric strings including apostrophes and stopwords) in the turn. The value of these features is integral and essentially counts the number of times a word is repeated in the turn. Although this indirectly accounts for alliterations, in the future studies, we plan to use more stylistic lexical features like (Mihalcea and Strapparava, 2005).</Paragraph>
    <Paragraph position="3"> Turn length is measured as the number of words in the turn. For our classification study, we consider eight speaker classes (6 Main actors, 1 for Guest and Multi) as shown in table 1, whereas for the gender study, we consider only two speaker categories (male and female) as shown in table 2.</Paragraph>
  </Section>
  <Section position="8" start_page="210" end_page="211" type="metho">
    <SectionTitle>
6 Humor-Prosody Analysis
</SectionTitle>
    <Paragraph position="0"> Humor and Non-Humor groups Table 3 shows mean values of various acoustic-prosodic features over all speaker turns in our data, across humor and non-humor groups. Features that have statistically (pBO=0.05 as per independent samples t-test) different values across the two groups are marked with asterisks. As one can see, all features except Mean-F0 and StdDev-F0 show significant differences across humorous and non-humorous speech. Table 3 shows that humorous turns in our data are longer, both in terms of the time duration and the number of words, than non-humorous turns. We also notice that humorous turns have smaller internal silence, and hence rapid tempo. Pitch (F0) and energy (RMS) features have higher maximum, but lower minimum  values, for humorous turns. This in turn gives higher values for range and standard deviation for humor compared to the non-humor group. This result is somewhat consistent with previous findings of (Liscombe et al., 2003) who found that most of these features are largely associated with positive and active emotional states such as happy, encouraging, confident etc. which are likely to appear in our humorous turns.</Paragraph>
  </Section>
  <Section position="9" start_page="211" end_page="211" type="metho">
    <SectionTitle>
7 Gender Effect on Humor-Prosody
</SectionTitle>
    <Paragraph position="0"> To analyze prosody of humor across two genders, we conducted a 2-way ANOVA test, using speaker gender (male/female) and humor (yes/no) as our fixed factors, and each of the above acoustic-prosodic features as a dependent variable. The test tells us the effect of humor on prosody adjusted for gender, the effect of gender on prosody adjusted for humor and also the effect of interaction between gender and humor on prosody (i.e.</Paragraph>
    <Paragraph position="1"> if the effect of humor on prosody differs according to gender). Table 4 shows results of 2-way ANOVA, where Y shows significant effects, and N shows non-significant effects. For example, the result for tempo shows that tempo differs significantly only across humor and non-humor groups, but not across the two gender groups, and that there is no effect of interaction between humor and gender on tempo. As before, all features except Mean-F0 and StdDev-F0 show significant differences across humor and no-humor conditions, even when adjusted for gender differences. The table also shows that all features except internal silence and tempo show significant differences across two genders, although only pitch features (Max-F0, Min-F0, and StdDev-F0) show the effect of interaction between gender and humor. In other words, the effect of humor on these pitch features is dependent on gender. For instance, if male speakers raise their pitch while expressing humor, female speakers might lower. To confirm this, we computed means values of various features for males and females separately (See Tables 5 and 6). These tables indeed suggest that male speakers show higher values for pitch features (Mean-F0, Min-F0, StdDev-F0), while expressing humor, whereas females show lower. Also for male speakers, differences in Min-F0 and Min-RMS values are not statistically significant across humor and non-humor groups, whereas for female speakers, features Mean-F0, StdDev-F0 and tempo do not show significant differences across the two groups.</Paragraph>
    <Paragraph position="2"> One can also notice that the differences in the mean pitch feature values (specifically Mean-F0, Max-F0 and Range-F0) between humor and non-humor groups are much higher for males than for females.</Paragraph>
    <Paragraph position="3"> In summary, our gender analysis shows that although most acoustic-prosodic features are different for males and females, the prosodic style of expressing humor by male and female speakers differs only along some pitch-features (both in magnitude and direction).</Paragraph>
  </Section>
  <Section position="10" start_page="211" end_page="212" type="metho">
    <SectionTitle>
ANOVA Results
8 Speaker Effect on Humor-Prosody
</SectionTitle>
    <Paragraph position="0"> We then conducted similar ANOVA test to account for the speaker differences, i.e. by considering humor (yes/no) and speaker (8 groups as shown in table 1) as our fixed factors and each of the acoustic-prosodic features as a dependent variable for a 2-Way ANOVA. Table 7 shows results of this analysis. As before, the table shows the effect of humor adjusted for speaker, the effect of speaker adjusted for humor and also the effect of interaction between humor and speaker, on each of the acoustic-prosodic features. According to table 7, we no longer see the effect of humor on features Min-F0, Mean-RMS and Tempo (in addition to Mean-F0 and StdDev-F0), in presence of the speaker variable. Speaker, on the other hand, shows significant effect on prosody for all features. But  surprisingly, again only pitch features Mean-F0, Max-F0 and Min-F0 show the interaction effect, suggesting that the effect of humor on these pitch features differs from speaker to speaker. In other words, different speakers use different pitch variations while expressing humor.</Paragraph>
  </Section>
  <Section position="11" start_page="212" end_page="213" type="metho">
    <SectionTitle>
9 Humor Recognition by Supervised
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="212" end_page="213" type="sub_section">
      <SectionTitle>
Learning
</SectionTitle>
      <Paragraph position="0"> We formulate our humor-recognition experiment as a classical supervised learning problem, by automatically classifying spoken turns into humor and non-humor groups, using standard machine learning classifiers. We used the decision tree algorithm ADTree from Weka, and ran a 10-fold cross validation experiment on all 1629 turns in our data1. The baseline for these experiments is 56.2% for the majority class (nonhumorous). Table 8 reports classification results for six feature categories: lexical alone, lexical + speaker, prosody alone, prosody + speaker, lexical + prosody and lexical + prosody + speaker (all).</Paragraph>
      <Paragraph position="1"> Numbers in braces show the number of features in each category. There are total 2025 features which include 2011 lexical (all word types plus turn length), 13 acoustic-prosodic and 1 for the speaker information. Feature Length was included in the lexical feature group, as it counts the number of lexical items (words) in the turn.</Paragraph>
      <Paragraph position="2">  All results are significantly above the baseline (as measured by a pair-wise t-test) with the best accuracy of 64% (8% over the baseline) obtained using all features. We notice that the classification accuracy improves on adding speaker information to both lexical and prosodic features. Although these results do not show a strong evidence that prosodic features are better than lexical, it is interesting to note that the performance of just a few (13) prosodic features is comparable to that of 2011 lexical features. Figure 3 shows the decision tree produced by the classifier in 10 iterations. Numbers indicate the order in which the nodes are created, and indentations mark parent-child relations. We notice that the classifier primarily selected speaker and prosodic features in the first 10 iterations, whereas lexical features were selected only in the later iterations (not shown here). This seems consistent with our original hypothesis that speech features are better at discriminating between humorous and non-humorous utterances in speech than lexical content.</Paragraph>
      <Paragraph position="3"> Although (Mihalcea and Strapparava, 2005) obtained much higher accuracies using lexical features alone, it might be due to the fact that our data is homogeneous in the sense that both humorous and non-humorous turns are extracted from the same source, and involve same speakers, which makes the two groups highly alike and hence challenging to distinguish. To make sure that the lower accuracy we get is not simply due to using smaller data compared to (Mihalcea and Strappar- null ava, 2005), we looked at the learning curve for the classifier (see figure 4) and found that the classifier performance is not sensitive to the amount of data.</Paragraph>
      <Paragraph position="4"> Table 9 shows classification results by gender, using all features. For the male group, the base-line is 50.6%, as the majority class humor is 50.6% (See Table 2). For females, the baseline is 60% (for non-humorous) as only 40% of the female turns are humorous.</Paragraph>
      <Paragraph position="5">  As Table 9 shows, the performance of the classifier is somewhat consistent cross-gender, although for male speakers, the relative improvement is much higher (14% above the baseline), than for females (only 5% above the baseline). Our earlier observation (from tables 5 and 6) that differences in pitch features between humor and non-humor  are shown) groups are quite higher for males than for females, may explain why we see higher improvement for male speakers.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML