File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/c04-1068_metho.xml
Size: 13,925 bytes
Last Modified: 2025-10-06 14:08:42
<?xml version="1.0" standalone="yes"?> <Paper uid="C04-1068"> <Title>Filtering Speaker-Specific Words from Electronic Discussions</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 The Filtering Mechanism </SectionTitle> <Paragraph position="0"> Our filtering mechanism identifies and removes idiosyncratic words used by dominant speakers. Such words typically have a high frequency in the postings authored by these speakers. Even though these words can appear anywhere in a person's posting, they appear mostly in signatures (about 75% of these words appear towards the end of a person's posting, while the remaining 25% are distributed throughout the posting). We therefore refer to them throughout this paper as signature words.</Paragraph> <Paragraph position="1"> The filtering mechanism operates in two stages: (1) profile building, and (2) signature-word removal. null Profile building. First, our system builds a 'profile', or distribution of word posting frequencies, for each person posting to a newsgroup. The posting frequency of a word is the number of postings where the word is used. For example, a person might have two postings in one newsgroup discussion, and three postings in another, in which case the maximum possible posting frequency for each word used by this person is five. Alternatively, one could count all occurrences of a word in a posting, which could be useful for constructing more detailed stylistic profiles. However, at present we are mainly concerned with words that appear across postings.</Paragraph> <Paragraph position="2"> Signature-word removal. In the second stage, word-usage proportions are calculated for each person. These are the word posting frequencies divided by the person's total number of postings. The aim of this calculation is to filter out words that have a very high proportion. In addition, we wanted to distinguish between the profile of a dominant individual and that of a non-dominant one. Hence, rather than just using a simple cut-off threshold for word-usage proportions, we base the decision to filter on the number of postings made by an individual as well as on the proportions. This is done by utilising a statistical significance test (a Bernoulli test) that measures if a proportion is significantly higher than a threshold (0.4),1 where significance is based on the number of postings.</Paragraph> <Paragraph position="3"> The impact of this filtering mechanism on the various newsgroups is shown in the last column of Table 1, which displays the average number of times the filter is applied per discussion thread. This number gives an indication of the existence of signature 1Although this threshold seems to pick out the signature words, we have found that the filtering mechanism is not very sensitive to this parameter. That is, its actual value is not important so long as it is sufficiently high.</Paragraph> <Paragraph position="4"> words from dominant speakers. For example, although the 'hp' newsgroup has a very dominant individual (who accounts for 17.5% of the postings), the filter is applied to this person's postings a very small number of times, as s/he does not have signature words. In contrast, the 'tex' and 'photoshop' newsgroups have less dominant individuals, but here the filter is applied more frequently, as these individuals do have signatures.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 The Clustering Procedure </SectionTitle> <Paragraph position="0"> The clustering algorithm we have chosen is the K-Means algorithm, because it is one of the simplest, fastest, and most popular clustering algorithms. Further, at this stage our focus is on investigating the effect of the filtering mechanism, rather than on finding the best clustering algorithm for the task at hand. K-Means places a0 centers, or centroids, in the input space, and assigns each data point to one of these centers, such that the total Euclidean distance between the points and the centers is minimised.</Paragraph> <Paragraph position="1"> Recall from Section 1 that our evaluative approach consists of merging discussion threads from multiple newsgroups into a single dataset, applying the clustering algorithm to this dataset, and then evaluating the resulting clusters using the known newsgroup memberships. Before describing how clusters created by K-Means are matched to newsgroups (Section 3.2), we describe the data representation used to form the input to K-Means.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Data Representation </SectionTitle> <Paragraph position="0"> As indicated in Section 1, we are interested in clustering complete newsgroup discussions rather than individual postings. Hence, we extract discussion threads from the newsgroups as units of representation. Each thread constitutes a document, which consists of a person's inquiry to a newsgroup and all the responses to the inquiry.</Paragraph> <Paragraph position="1"> Our data representation is a bag-of-words with TF.IDF scoring (Salton and McGill, 1983). Each document (thread) yields one data point, which is represented by a vector. The components of the vector correspond to the words chosen to represent a newsgroup. The values of these components are the normalised TF.IDF scores of these words.</Paragraph> <Paragraph position="2"> The words chosen to represent a newsgroup are all the words that appear in the newsgroup, except function words, very frequent words (whose frequency is greater than the 95th percentile of the newsgroup's word frequencies), and very infrequent words (which appeared less than 20 times throughout the newsgroup). This yields vectors whose typical dimensionality (i.e. the number of words retained) is between 1000 and 2000. Since dimensionality reduction is not detrimental to retrieval performance (Sch&quot;utze and Pedersen, 1995) and speeds up the clustering process, we use Principal Components Analysis (Afifi and Clark, 1996) to reduce the dimensionality of our dataset. This process yields vectors of size 200.</Paragraph> <Paragraph position="3"> The TF.IDF method is used to calculate the score of each word. This method rewards words that appear frequently in a document (term frequency -TF), and penalises words that appear in many documents (inverse document frequency - IDF). There are several ways to calculate TF.IDF (Salton and McGill, 1983). In our experiments it is calculated as TFa1a3a2a5a4a7a6a9a8a11a10a13a12a15a14a17a16a18a1a3a2a20a19a22a21a24a23 and IDFa1a25a4a26a6a9a8a11a10a27a12a11a14a29a28a31a30a33a32a34a1a35a23 , where a16 a1a3a2 is the frequency of word a36 in document a37 , a32 a1 is the number of documents where word a36 appears, and a28 is the total number of documents in the dataset. In order to reduce the effect of document length, the TF.IDF score of a word in a document is then normalised by taking into account the scores of the other words in the document.</Paragraph> <Paragraph position="4"> One might expect that the IDF component should be able to reduce the influence of signature words of dominant individuals in a newsgroup. However, IDF alone cannot distinguish between words that are representative of a newsgroup and signature words of frequent contributors, i.e. it would discount these equally. Further, we have observed that an individual does not have to post to many threads (documents) for his/her signature words to influence the clustering process. Since IDF discounts words that occur in many documents, it would fail to discount signature words that appear mainly in the subset of documents where such individuals have postings.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Clustering and Identification </SectionTitle> <Paragraph position="0"> In order to evaluate the clusters produced by K-Means for a particular dataset, we compare each document's cluster assignment to its true 'label' -a value that identifies the newsgroup to which the document belongs, of which there are a38 (three in the dataset considered here). However, because K-Means is an unsupervised mechanism, we do not know which cluster to compare with which newsgroup. We resolve this issue as follows.</Paragraph> <Paragraph position="1"> We calculate the goodness of the match between each cluster a36a5a39a41a40a13a21a43a42a9a42a0a45a44 and each newsgroup a37 a39</Paragraph> <Paragraph position="3"> Retrieval (Salton and McGill, 1983). This gives an overall measure of how well the cluster represents the newsgroup, taking into account the 'correctness' of the cluster (precision) and how much of the newsgroup it accounts for (recall). Precision is calculated</Paragraph> <Paragraph position="5"> Once all the a2 a1a3a2 have been calculated, we choose for each cluster the best newsgroup assignment, i.e.</Paragraph> <Paragraph position="6"> the one with the highest F-score. As a result of this process, multiple clusters may be assigned to the same newsgroup, in which case they are pooled into a single cluster. The F-score is then re-calculated for each pooled cluster to give an overall performance measure for these clusters.</Paragraph> <Paragraph position="7"> The clustering procedure is evaluated using two main measures: (1) the number of newsgroups that were matched by the generated clusters (between 1 and a38 ), and (2) the F-score of the pooled clusters.</Paragraph> <Paragraph position="8"> The first measure estimates how many clusters are needed to find all the newsgroups, while the second measure assesses the quality of these clusters.</Paragraph> <Paragraph position="9"> Further, the number of clusters that are needed to achieve an acceptable quality of performance suggests the level of granularity needed to separate the newsgroups (few clusters correspond to a coarse level of granularity, many clusters to a fine one).</Paragraph> <Paragraph position="10"> The clustering procedure is also evaluated as a whole by calculating its overall precision, i.e. the proportion of documents that were assigned correctly over the whole dataset. Note that the over-all recall is the same as the overall precision, since the denominators in both measures consist of all the documents in the dataset. Hence, the F-score is equal to the precision.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.3 Example </SectionTitle> <Paragraph position="0"> We now show a sample output of the clustering procedure described above, with and without the filtering mechanism described in Section 2. Tables 2 and 3 display the pooled clusters created without and with filtering, respectively. These tables show how many clusters were found for each newsgroup, the number of documents in each pooled cluster, and the performance of the cluster (P, R and F). The tables also present the top 30 representative words in each cluster (restricted to 30 due to space limitations). These words are sorted in decreasing order of their average TF.IDF score over the documents in the cluster (words representative of a cluster should have high TF.IDF scores, because they appear frequently in the documents in the cluster, and infrequently in the documents in other clusters).</Paragraph> <Paragraph position="1"> According to the results in Table 2, the top-30 list for the 'hp' cluster does not have many signature words. This was anticipated by the observation that the filtering mechanism was applied very rarely to the 'hp' newsgroup (Table 1). In contrast, the majority of the top-30 words in the 'tex' cluster are signature words (some exceptions are 'chapter', 'english' and 'examples'). We conclude that this pooled cluster was created (using two different clusters) to represent the various signatures in the 'tex' newsgroup.</Paragraph> <Paragraph position="2"> Further, a relatively small number of documents are assigned to the 'tex' cluster, which therefore has a very low recall value (0.34). Its precision is perfect, but its low recall suggests that many of the documents representing the true topics of this newsgroup were assigned to other clusters.</Paragraph> <Paragraph position="3"> The 'photoshop' cluster has a very high precision and recall, so most of the 'photoshop' documents were assigned correctly. However, here too many of the top words are signature words. Even when the 'obvious' signature words are ignored (such as URLs and people's names), there are still words that confuse the topics of this newsgroup, such as 'million', 'america', 'urban' and 'dragon'.</Paragraph> <Paragraph position="4"> In Table 3 most of the words discovered by the clustering procedure represent the true topics of the newsgroups. The filtering mechanism removes the dominant signature words, and thus the clustering procedure is able to find the true topic-related clusters (precision and recall are very high for all pooled clusters). Notice that there are still some signaturerelated words, such as 'arseneau' and 'fairbairns' in the 'tex' cluster, and 'tacit' and 'gifford' in the 'photoshop' cluster. These words correspond mainly to a dominant individual's name or email address, and the filtering mechanism fails to filter them when other individuals reply to the dominant individual using these words. In a thread (document) containing a dominant individual, that individual's signature words are filtered, but unless the people replying to the dominant individual are dominant themselves, the words they use to refer to this individual will not be filtered, and therefore will influence the clustering process. This highlights further the problem that our filtering mechanism is addressing, and suggests that more filtering should be done.</Paragraph> </Section> </Section> class="xml-element"></Paper>