File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/05/p05-1054_evalu.xml
Size: 15,974 bytes
Last Modified: 2025-10-06 13:59:25
<?xml version="1.0" standalone="yes"?> <Paper uid="P05-1054"> <Title>A Quantitative Analysis of Lexical Differences Between Genders in Telephone Conversations</Title> <Section position="6" start_page="436" end_page="440" type="evalu"> <SectionTitle> 4 Experiments </SectionTitle> <Paragraph position="0"> Having explained the methods and data that we have used, we set forward to investigate a number of research questions concerning the nature of differences between genders. Each subsection is concerned with a single question.</Paragraph> <Section position="1" start_page="436" end_page="437" type="sub_section"> <SectionTitle> 4.1 Given only the transcript of a conversation, </SectionTitle> <Paragraph position="0"> is it possible to classify conversation sides according to the gender of the speaker? The first hypothesis we investigate is whether simple features, such as counts of individual terms (unigrams) or pairs of terms (bigrams) have different distributions between genders. The set of possible terms consists of all words in the Fisher corpus plus some non-lexical tokens such as laughter and filled pauses. One way to assess the difference in their distribution is by attempting to classify conversation sides according to the gender of the speaker. The results are shown in Table 1, where a number of different text classification algorithms were applied to classify conversation sides. 14969 conversation sides are used for training and 3738 sides are used for testing. No feature selection was performed; in all classifiers a vocabulary of all unigrams or bi-grams with 5 or more occurrences is used (20513 for unigrams, 306779 for bigrams). For all algorithms, except Naive Bayes, we have used the tf*idf representation. The Rainbow toolkit (McCallum, 1996) was used for training the classifiers. Results show that differences between genders are clear and the best results are obtained by using SVMs. The fact that classification performance is significantly above chance for a variety of learning methods shows that lexical differences between genders are inherent in the data and not in a specific choice of classifier.</Paragraph> <Paragraph position="1"> From Table 1 we also observe that using bigrams is consistently better than unigrams, despite the fact that the number of unique terms rises from [?]20K to [?]300K. This suggests that gender differences become even more profound for phrases, a finding similar to (Doddington, 2001) for speaker differences.</Paragraph> </Section> <Section position="2" start_page="437" end_page="438" type="sub_section"> <SectionTitle> 4.2 Does the gender of a conversation side </SectionTitle> <Paragraph position="0"> influence lexical usage of the other conversation side? Each conversation always consists of two people talking to each other. Up to this point, we have only attempted to analyze a conversation side in isolation, i.e. without using transcriptions from the other side. In this subsection, we attempt to assess the degree to which, if any, the gender of one speaker influences the language of the other speaker. In the first experiment, instead of defining two categories we define four; the Cartesian product of the gender of the current speaker and the gender of the other speaker. These categories are symbolized with two letters: the first characterizing the gender of the current speaker and the second the gender of the other speaker, i.e. FF, FM, MF, MM. The task remains the same: given the transcript of a conversation side, classify it according to the appropriate category. This is a task much harder than the binary classification we had in subsection 4.1, because given only the transcript of a conversation side we must make inferences about the gender of the current as well as the other conversation side. We have used SVMs as the learning method. In their basic formulation, SVMs are binary classifiers (although there has been recent work on multi-class SVMs). We followed the original binary formulation and converted the 4-class problem to 6 2-class problems. The final decision is taken by voting of the individual systems.</Paragraph> <Paragraph position="1"> The confusion matrix of the 4-way classification is shown in Table 2.</Paragraph> <Paragraph position="2"> of gender of both sides using transcripts from one side. Unigrams are used as features, SVMs as classification method. Each row represents the true category and each column the hypothesized category.</Paragraph> <Paragraph position="3"> The results show that although two of the four categories, FF and MM, are quite robustly detected the other two, FM and MF, are mostly confused with FF and MM respectively. These results can be mapped to single gender detection, giving accuracy of 85.9% for classifying the gender of the given transcript (as in Table 1) and 68.5% for classifying the gender of the conversational partner. The accuracy of 68.5% is higher than chance (57.8%) showing that genders alter their linguistic patterns depending on the gender of their conversational partner.</Paragraph> <Paragraph position="4"> In the next experiment we design two binary classifiers. In the first classifier, the task is to correctly classify FF vs. MM transcripts, and in the second classifier the task is to classify FM vs. MF transcripts. Therefore, we attempt to classify the gender of a speaker given knowledge of whether the conversation is same-gender or cross-gender. For both classifiers 4526 sides were used for training equally divided among each class. 2558 sides were used for testing of the FF-MM classifier and 1180 sides for the FM-MF classifier. The results are shown in Table 3.</Paragraph> <Paragraph position="5"> It is clear from Table 3 that there is a significant difference in performance between the FF-MM and FM-MF classifiers, suggesting that people alter their linguistic patterns depending on the gender of the person they are talking to. In same-gender conversations, almost perfect accuracy is reached, indicating that the linguistic patterns of the two genders be- null and cross-gender conversations. SVMs are used as the classification method; no feature selection is applied. null come very distinct. In cross-gender conversations the differences become less prominent since classification accuracy drops compared to same-gender conversations. This result, however, does not reveal how this convergence of linguistic patterns is achieved. Is it the case that the convergence is attributed to one of the genders, for example males attempting to match the patterns of females, or is it collectively constructed? To answer this question, we can examine the classification performance of two other binary classifiers FF vs. FM and MM vs.</Paragraph> <Paragraph position="6"> MF. The results are shown in Table 4. In both classifiers 4608 conversation sides are used for training, equally divided in each class. The number of sides used for testing is 989 and 689 for the FF-FM and only the transcript of speaker A. SVMs are used as the classification method; no feature selection is applied. null The results in Table 4 suggest that both genders equally alter their linguistic patterns to match the opposite gender. It is interesting to see that the gender of speaker B can be detected better than chance given only the transcript and gender of speaker A.</Paragraph> <Paragraph position="7"> The results are better than chance at the 0.0005 significance level.</Paragraph> </Section> <Section position="3" start_page="438" end_page="439" type="sub_section"> <SectionTitle> 4.3 Are some features more indicative of </SectionTitle> <Paragraph position="0"> gender than other? Having shown that gender lexical differences are prominent enough to classify each speaker according to gender quite robustly, another question is whether the high classification accuracies can be attributed to a small number of features or are rather the cumulative effect of a high number of them. In Table 5 we apply the two feature selection criteria that were described in 3.</Paragraph> <Paragraph position="1"> The results of Table 5 show that lexical differences between genders are not isolated in a small set of words. The best results are achieved with 40% (IG) and 70% (KL) of the features, using fewer features steadily degrades the performance. Using the 5000 least discriminative unigrams and Naive Bayes as the classification method resulted in 58.4% classification accuracy which is not statistically better than chance (this is the test set of Tables 1 and 2 not of Table 4) . Using the 15000 least useful unigrams resulted in a classification accuracy of 66.4%, which shows that the number of irrelevant features is rather small, about 5K features.</Paragraph> <Paragraph position="2"> It is also instructive to see which features are most discriminative for each gender. The features that when present are most indicative of each gender (positive features) are shown in Table 6. They are sorted using the KL distance and dropping the summation over both genders in equation (2). Looking at the top 2000 features for each number we observed that a number of swear words appear as most discriminative for males and family-relation terms are often associated with females. For example the following words are in the top 2000 (out of 20513) most useful features for males shit, bullshit, shitty, fuck, fucking, fucked, bitching, bastards, ass, asshole, sucks, sucked, suck, sucker, damn, goddamn, damned. The following words are in the top 2000 features for females children, grandchild, each gender according to KL distance. Words higher in the list are more discriminative.</Paragraph> <Paragraph position="3"> child, grandchildren, childhood, childbirth, kids, grandkids, son, grandson, daughter, granddaughter, boyfriend, marriage, mother, grandmother. It is also interesting to note that a number of non-lexical tokens are strongly associated with a certain gender. For example, [laughter] and acknowledgments/backchannels such as uh-huh,uhuh were in the top 2000 features for females. On the other hand, filled pauses such as uh were strong male indicators. Our analysis also reveals that a high number of useful features are names. A possible explanation is that people usually introduce themselves at the beginning of the conversation. In the top 30 words per gender, names represent over half of the words for males and nearly a quarter for females. Nearly a third were family-relations words for females, and When examining cross-gender conversations, the discriminative words were quite substantially different. We can quantify the degree of change by measuring KLSG(w) [?]KLCG(w) where KLSG(w) is the KL measure of word w for same-gender conversations. The analysis reveals that swear terms are highly associated with male-only conversations, while family-relation words are highly associated with female-only conversations.</Paragraph> <Paragraph position="4"> From the traditional sociolinguistic perspective, these methods offer a way of discovering rather than testing words or phrases that have distinct usage between genders. For example, in a recent paper (Kiesling, in press) the word dude is analyzed as a male-to-male indicator. In our work, the word dude emerged as a male feature. As another example, our observation that some acknowledgments and backchannels (uh-huh) are more common for females than males while the reverse is true for filled pauses asserts a popular theory in sociolinguistics that males assume a more dominant role than females in conversations (Coates, 1997). Males tend to hold the floor more than women (more filled pauses) and females tend to be more responsive (more acknowledgments/backchannels).</Paragraph> </Section> <Section position="4" start_page="439" end_page="440" type="sub_section"> <SectionTitle> 4.4 Are gender-discriminative features </SectionTitle> <Paragraph position="0"> content-bearing words? Do the most gender-discriminative words contribute to the topic of the conversation, or are they simple fill-in words with no content? Since each conversation is labeled with one of 40 possible topics, we can rank features with IG or KL using topics instead of genders as categories. In fact, this is the standard way of performing feature selection for text classification. We can then compare the performance of classifying conversations to topics using the top-N features according to the gender or topic ranking.</Paragraph> <Paragraph position="1"> The results are shown in Table 7.</Paragraph> <Paragraph position="2"> gender-discriminative words, sorted using the information gain criterion. When randomly selecting 5000 features, 10 independent runs were performed and numbers reported are mean and standard deviation. Using the bottom 5000 topic words resulted in From Table 7 we can observe that gender-discriminative words are clearly not the most relevant nor the most irrelevant features for topic classification. They are slightly more topic-relevant features than topic-irrelevant but not by a significant margin. The bottom 5000 features for gender discrimination are more strongly topic-irrelevant words.</Paragraph> <Paragraph position="3"> These results show that gender linguistic differences are not merely isolated in a set of words that would function as markers of gender identity but are rather closely intertwined with semantics. We attempted to improve topic classification by training gender-dependent topic models but we did not observe any gains.</Paragraph> </Section> <Section position="5" start_page="440" end_page="440" type="sub_section"> <SectionTitle> 4.5 Can gender lexical differences be exploited </SectionTitle> <Paragraph position="0"> to improve automatic speech recognition? Are the observed gender linguistic differences valuable from an engineering perspective as well? In other words, can a natural language processing task benefit from modeling these differences? In this subsection, we train gender-dependent language models and compare their perplexities with standard baselines. An advantage of using gender information for automatic speech recognition is that it can be robustly detected using acoustic features. In Tables 8 and 9 the perplexities of different gender-dependent language models are shown. The SRILM toolkit (Stolcke, 2002) was used for training the language models using Kneser-Ney smoothing (Kneser and Ney, 1987). The perplexities reported include the end-of-turn as a separate token. 2300 conversation sides are used for training each one of {FF,FM,MF,MM} models of Table 8, while 7670 conversation sides are used for training each one of {F,M} models of Table 9. In both tables, the same 1678 sides are used for testing.</Paragraph> <Paragraph position="1"> Table 8: Perplexity of gender-dependent bigram language models. Four gender categories are used.</Paragraph> <Paragraph position="2"> Each column has the perplexities for a given test set, each row for a train set.</Paragraph> <Paragraph position="3"> In Tables 8 and 9 we observe that we get lower perplexities in matched than mismatched conditions in training and testing. This is another way to show that different data do exhibit different properties.</Paragraph> <Paragraph position="4"> However, the best results are obtained by pooling all the data and training a single language model.</Paragraph> <Paragraph position="5"> Therefore, despite the fact there are different modes, Table 9: Perplexity of gender-dependent bigram language models. Two gender categories are used.</Paragraph> <Paragraph position="6"> Each column has the perplexities for a given test set, each row for a train set.</Paragraph> <Paragraph position="7"> the benefit of more training data outweighs the benefit of gender-dependent models. Interpolating ALL with F and ALL with M resulted in insignificant improvements (81.6 for F and 89.3 for M).</Paragraph> </Section> </Section> class="xml-element"></Paper>