File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/p05-1054_metho.xml

Size: 7,309 bytes

Last Modified: 2025-10-06 14:09:49

<?xml version="1.0" standalone="yes"?>
<Paper uid="P05-1054">
  <Title>A Quantitative Analysis of Lexical Differences Between Genders in Telephone Conversations</Title>
  <Section position="4" start_page="435" end_page="435" type="metho">
    <SectionTitle>
2 The Corpus and Data Preparation
</SectionTitle>
    <Paragraph position="0"> The Fisher corpus (Cieri et al., 2004) was used in all our experiments. It consists of telephone conversations between two people, randomly assigned to speak to each other. At the beginning of each conversation a topic is suggested at random from a list of 40. The latest release of the Fisher collection has more than 16 000 telephone conversations averaging 10 minutes each. Each person participates in 1-3 conversations, and each conversation is annotated with a topicality label. The topicality label gives the degree to which the suggested topic was followed and is an integer number from 0 to 4, 0 being the worse. In our site, we had an earlier version of the Fisher corpus with around 12 000 conversations. After removing conversations where at least one of the speakers was non-native1 and conversations with topicality 0 or 1 we were left with 10 127 conversations. The original transcripts were minimally processed; acronyms were normalized to a sequence of characters with no intervening spaces, e.g. t. v. to tv; word fragments were converted to the same token wordfragment; all words were lowercased; and punctuation marks and special characters were removed. Some non-lexical tokens are maintained such as laughter and filled pauses such as uh, um. Backchannels and acknowledgments such as uh-huh, mm-hmm are also kept. The gender distribution of the Fisher corpus is 53% female and 47% male. Age distribution is 38% 16-29, 45% 30-49% and 17% 50+. Speakers were connected at random 1About 10% of speakers are non-native making this corpus suitable for investigating their lexical differences compared to American English speakers.</Paragraph>
    <Paragraph position="1"> from a pool recruited in a national ad campaign. It is unlikely that the speakers knew their conversation partner. All major American English dialects are well represented, see (Cieri et al., 2004) for more details. The Fisher corpus was primarily created to facilitate automatic speech recognition research. The subset we have used has about 17.8M words or about 1 600 hours of speech and it is the largest resource ever used to analyze gender linguistic differences.</Paragraph>
    <Paragraph position="2"> In comparison, (Singh, 2001) has used about 30 000 words for their analysis.</Paragraph>
    <Paragraph position="3"> Before attempting to analyze the gender differences, there are two main biases that need to be removed. The first bias, which we term the topic bias is introduced by not accounting for the fact that the distribution of topics in males and females is uneven, despite the fact that the topic is pre-assigned randomly. For example, if topic A happened to be more common for males than females and we failed to account for that, then we would be implicitly building a topic classifier rather than a gender classifier. Our intention here is to analyze gender linguistic differences controlling for the topic effect as if both genders talk equally about the same topics. The second bias, which we term speaker bias is introduced by not accounting for the fact that specific speakers have idiosyncratic expressions. If our training data consisted of a small number of speakers appearing in both training and testing data, then we will be implicitly modeling speaker differences rather than gender differences.</Paragraph>
    <Paragraph position="4"> To normalize for these two important biases, we made sure that both genders have the same percent of conversation sides for each topic and there are 8899 speakers in training and 2000 in testing with no overlap between the two sets. After these two steps, there were 14969 conversation sides used for training and 3738 sides for testing. The median length of a conversation side was 954.</Paragraph>
  </Section>
  <Section position="5" start_page="435" end_page="436" type="metho">
    <SectionTitle>
3 Machine Learning Methods Used
</SectionTitle>
    <Paragraph position="0"> The methods we have used for characterizing the differences between genders and gender pairs are similar to what has been used for the task of text classification. In text classification, the objective is to classify a document vectord to one (or more) of T pre-defined topics y. A number of N tuples (vectordn,yn)  are provided for training the classifier. A major challenge of text classification is the very high dimensionality for representing each document which brings forward the need for feature selection, i.e. selecting the most discriminative words and discarding all others.</Paragraph>
    <Paragraph position="1"> In this study, we chose two ways for characterizing the differences between gender categories. The first, is to classify the transcript of each speaker, i.e. each conversation side, to the appropriate gender category. This approach can show the cumulative effect of all terms on the distinctiveness of gender categories. The second approach is to apply feature selection methods, similar to those used in text categorization, to reveal the most characteristic features for each gender.</Paragraph>
    <Paragraph position="2"> Classifying a transcript of speech according to gender can be done with a number of different learning methods. We have compared Support Vector Machines (SVMs), Naive Bayes, Maximum Entropy and the tfidf/Rocchio classifier and found SVMs to be the most successful. A possible difference between text classification and gender classification is that different methods for feature weighting may be appropriate. In text classification, inverse document frequency is applied to the frequency of each term resulting in the deweighting of common terms. This weighting scheme is effective for text classification because common terms do not contribute to the topic of a document. However, the reverse may be true for gender classification, where the common terms may be the ones that mostly contribute to the gender category. This is an issue that we will investigate in section 4 and has implications for the feature weighting scheme that needs to be applied to the vector representation. null In addition to classification, we have applied feature selection techniques to assess the discriminative ability of each individual feature. Information gain has been shown to be one of the most successful feature selection methods for text classification (Forman, 2003). It is given by:</Paragraph>
    <Paragraph position="4"> entropy of the discrete gender category random variable C. Each document is represented with the Bernoulli model, i.e. a vector of 1 or 0 depending if the word appears or not in the document. We have also implemented another feature selection mechanism, the KL-divergence, which is given by:</Paragraph>
    <Paragraph position="6"> (2) In the KL-divergence we have used the multinomial model, i.e. each document is represented as a vector of word counts. We smoothed the p(w|c) distributions by assuming that every word in the vocabulary is observed at least 5 times for each class.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML