Filtering Speaker-Specific Words from Electronic Discussions
Ingrid Zukerman and Yuval Marom
School of Computer Science and Software Engineering
Monash University, Clayton, Victoria 3800, AUSTRALIA
a0 ingrid, yuvalm
a1 @csse.monash.edu.au
Abstract
The work presented in this paper is the first
step in a project which aims to cluster and sum-
marise electronic discussions in the context of
help-desk applications. The eventual objective
of this project is to use these summaries to as-
sist help-desk users and operators. In this pa-
per, we identify features of electronic discus-
sions that influence the clustering process, and
offer a filtering mechanism that removes unde-
sirable influences. We tested the clustering and
filtering processes on electronic newsgroup dis-
cussions, and evaluated their performance by
means of two experiments: coarse-level cluster-
ing and simple information retrieval. Our eval-
uation shows that our filtering mechanism has a
significant positive effect on both tasks.
1 Introduction
The ability to draw on past experience is often use-
ful in information-providing applications. For in-
stance, users who interact with help-desk applica-
tions would benefit from the availability of rele-
vant contextual information about their request, e.g.,
from previous, similar interactions between the sys-
tem and other users, or from interactions between
domain experts.
The work reported in this paper is the first step
in a project which aims to provide such informa-
tion. The eventual objective of our project is to au-
tomatically identify related interactions in help-desk
applications, and to generate summaries from their
combined experience. These summaries would then
assist both users and operators.
Our approach to the identification of related in-
teractions hinges on the application of clustering
techniques. These techniques have been used in
Information Retrieval for some time (e.g. Salton,
1971). They involve grouping a set of related docu-
ments, and then using a representative element to
match input queries (as opposed to matching the
whole collection of documents). Document cluster-
ing has been used in search engine applications to
improve and speed up retrieval (e.g. Zamir and Et-
zioni, 1998), but also for more descriptive purposes,
such as using representative elements of a cluster to
generate lists of keywords (Neto et al., 2000).
However, discussions (and dialogues in general)
have distinguishing features which make cluster-
ing a corpus of such interactions a more challeng-
ing task than clustering plain documents. These
features are: (1) the corpus consists of contribu-
tions made by a community of authors, or “speak-
ers”; (2) certain speakers are more dominant in the
corpus; and (3) speakers often use idiosyncratic,
speaker-specific language, or make comments that
are not about the task at hand.
In this paper, we report on a preliminary study
where we cluster discussions carried out in elec-
tronic newsgroups. Specifically, we report on the
influence of the above features on the clustering pro-
cess, and describe a filtering mechanism that identi-
fies and removes undesirable influences.
Table 1 shows the newsgroups used as data
sources in our experiments. These newsgroups were
obtained from the Internet. The table shows the
number of threads in each newsgroup, the number
of people posting to the newsgroups, and the high-
est number of postings by an individual for each
newsgroup. It also shows the impact of the filter-
ing mechanism on each newsgroup (Section 2).
The clustering process and filtering mechanism
were evaluated by means of two experiments:
(1) coarse-level clustering, and (2) simple informa-
tion retrieval.
Coarse-level clustering. This experiment con-
sists of merging the discussion threads (documents)
in different newsgroups into a single dataset, and
applying a clustering mechanism to separate them.
The performance of the clustering mechanism is
then evaluated by how well the generated clusters
match the original newsgroups from which the dis-
cussion threads were obtained. Clearly, this eval-
uation is at a coarser level of granularity than that
required for our final system. However, we find it
useful for the following reasons:
newsgroup number of number of most frequent average filter
threads people number of postings usage (per thread)
lp.hp 1920 1707 715 (17.5%) 0.5
comp.text.tex 1383 1140 246 (4.6%) 7.4
comp.graphics.apps.photoshop 1637 1586 395 (5.8%) 6.9
Table 1: Description of the newsgroups.
a0 Owing to the number and diversity of news-
groups on the Internet, we can perform con-
trolled experiments where we vary the degree
of similarity between newsgroups, thereby
simulating discussions with different levels of
relatedness.
a0 Our experiments show that our filtering mech-
anism has a positive influence at different lev-
els of granularity (Section 4). Hence, there is
reason to expect that this influence will remain
for finer levels of granularity, e.g., the level of
a task or request.
a0 Finally, the different newsgroups are identified
in advance, which obviates the need for manual
discussion-tagging at this stage.
Due to space limitations, we report only on a sub-
set of our experiments. In (Marom and Zukerman,
2004) we present a comparative study that consid-
ers different sets of newsgroups of varying levels of
relatedness. We regard the set of newsgroups pre-
sented here as having a “medium” level of related-
ness.
Simple information retrieval. This experiment
constitutes a simplistic and restricted version of the
document retrieval functionality envisaged for our
eventual system. In this experiment, we matched
pairs of query terms to the centroids of the generated
clusters, and assessed the system’s ability to retrieve
relevant discussion threads from the best-matching
cluster, with and without filtering. The experiment
makes the implicit assumption that the corpus con-
tains discussions relevant to incoming requests, i.e.
that new requests are similar to old ones. We believe
that the results of this restricted experiment are in-
dicative of future system performance, as the envis-
aged system is also expected to operate under this
assumption.
Next, we describe our filtering mechanism. Sec-
tion 3 describes the clustering procedure, includ-
ing our data representation and cluster identification
method. Section 4 presents the results from our ex-
periments, and Section 5 concludes the paper.
2 The Filtering Mechanism
Our filtering mechanism identifies and removes id-
iosyncratic words used by dominant speakers. Such
words typically have a high frequency in the post-
ings authored by these speakers. Even though these
words can appear anywhere in a person’s posting,
they appear mostly in signatures (about 75% of
these words appear towards the end of a person’s
posting, while the remaining 25% are distributed
throughout the posting). We therefore refer to them
throughout this paper as signature words.
The filtering mechanism operates in two stages:
(1) profile building, and (2) signature-word re-
moval.
Profile building. First, our system builds a ‘pro-
file’, or distribution of word posting frequencies, for
each person posting to a newsgroup. The posting
frequency of a word is the number of postings where
the word is used. For example, a person might have
two postings in one newsgroup discussion, and three
postings in another, in which case the maximum
possible posting frequency for each word used by
this person is five. Alternatively, one could count all
occurrences of a word in a posting, which could be
useful for constructing more detailed stylistic pro-
files. However, at present we are mainly concerned
with words that appear across postings.
Signature-word removal. In the second stage,
word-usage proportions are calculated for each per-
son. These are the word posting frequencies divided
by the person’s total number of postings. The aim
of this calculation is to filter out words that have
a very high proportion. In addition, we wanted to
distinguish between the profile of a dominant in-
dividual and that of a non-dominant one. Hence,
rather than just using a simple cut-off threshold for
word-usage proportions, we base the decision to fil-
ter on the number of postings made by an individual
as well as on the proportions. This is done by util-
ising a statistical significance test (a Bernoulli test)
that measures if a proportion is significantly higher
than a threshold (0.4),1 where significance is based
on the number of postings.
The impact of this filtering mechanism on the var-
ious newsgroups is shown in the last column of Ta-
ble 1, which displays the average number of times
the filter is applied per discussion thread. This num-
ber gives an indication of the existence of signature
1Although this threshold seems to pick out the signature
words, we have found that the filtering mechanism is not very
sensitive to this parameter. That is, its actual value is not im-
portant so long as it is sufficiently high.
words from dominant speakers. For example, al-
though the ‘hp’ newsgroup has a very dominant in-
dividual (who accounts for 17.5% of the postings),
the filter is applied to this person’s postings a very
small number of times, as s/he does not have sig-
nature words. In contrast, the ‘tex’ and ‘photo-
shop’ newsgroups have less dominant individuals,
but here the filter is applied more frequently, as
these individuals do have signatures.
3 The Clustering Procedure
The clustering algorithm we have chosen is the
K-Means algorithm, because it is one of the sim-
plest, fastest, and most popular clustering algo-
rithms. Further, at this stage our focus is on in-
vestigating the effect of the filtering mechanism,
rather than on finding the best clustering algorithm
for the task at hand. K-Means places a0 centers, or
centroids, in the input space, and assigns each data
point to one of these centers, such that the total Eu-
clidean distance between the points and the centers
is minimised.
Recall from Section 1 that our evaluative ap-
proach consists of merging discussion threads from
multiple newsgroups into a single dataset, apply-
ing the clustering algorithm to this dataset, and then
evaluating the resulting clusters using the known
newsgroup memberships. Before describing how
clusters created by K-Means are matched to news-
groups (Section 3.2), we describe the data represen-
tation used to form the input to K-Means.
3.1 Data Representation
As indicated in Section 1, we are interested in clus-
tering complete newsgroup discussions rather than
individual postings. Hence, we extract discussion
threads from the newsgroups as units of represen-
tation. Each thread constitutes a document, which
consists of a person’s inquiry to a newsgroup and
all the responses to the inquiry.
Our data representation is a bag-of-words with
TF.IDF scoring (Salton and McGill, 1983). Each
document (thread) yields one data point, which is
represented by a vector. The components of the vec-
tor correspond to the words chosen to represent a
newsgroup. The values of these components are the
normalised TF.IDF scores of these words.
The words chosen to represent a newsgroup are
all the words that appear in the newsgroup, except
function words, very frequent words (whose fre-
quency is greater than the 95th percentile of the
newsgroup’s word frequencies), and very infrequent
words (which appeared less than 20 times through-
out the newsgroup). This yields vectors whose typ-
ical dimensionality (i.e. the number of words re-
tained) is between 1000 and 2000. Since dimension-
ality reduction is not detrimental to retrieval per-
formance (Sch¨utze and Pedersen, 1995) and speeds
up the clustering process, we use Principal Compo-
nents Analysis (Afifi and Clark, 1996) to reduce the
dimensionality of our dataset. This process yields
vectors of size 200.
The TF.IDF method is used to calculate the score
of each word. This method rewards words that ap-
pear frequently in a document (term frequency –
TF), and penalises words that appear in many docu-
ments (inverse document frequency – IDF). There
are several ways to calculate TF.IDF (Salton and
McGill, 1983). In our experiments it is calculated
as TFa1a3a2a5a4a7a6a9a8a11a10a13a12a15a14a17a16a18a1a3a2a20a19a22a21a24a23 and IDFa1a25a4a26a6a9a8a11a10a27a12a11a14a29a28a31a30a33a32a34a1a35a23 ,
where a16 a1a3a2 is the frequency of word a36 in document
a37 ,
a32 a1 is the number of documents where word a36 ap-
pears, and a28 is the total number of documents in the
dataset. In order to reduce the effect of document
length, the TF.IDF score of a word in a document is
then normalised by taking into account the scores of
the other words in the document.
One might expect that the IDF component should
be able to reduce the influence of signature words
of dominant individuals in a newsgroup. However,
IDF alone cannot distinguish between words that are
representative of a newsgroup and signature words
of frequent contributors, i.e. it would discount these
equally. Further, we have observed that an individ-
ual does not have to post to many threads (docu-
ments) for his/her signature words to influence the
clustering process. Since IDF discounts words that
occur in many documents, it would fail to discount
signature words that appear mainly in the subset of
documents where such individuals have postings.
3.2 Clustering and Identification
In order to evaluate the clusters produced by K-
Means for a particular dataset, we compare each
document’s cluster assignment to its true ‘label’ —
a value that identifies the newsgroup to which the
document belongs, of which there are a38 (three in
the dataset considered here). However, because K-
Means is an unsupervised mechanism, we do not
know which cluster to compare with which news-
group. We resolve this issue as follows.
We calculate the goodness of the match between
each cluster a36a5a39a41a40a13a21a43a42a9a42a0a45a44 and each newsgroup a37 a39
a40a13a21a43a42a9a42a3a38
a44 (a0a47a46
a38 ) using the F-score from Information
Retrieval (Salton and McGill, 1983). This gives an
overall measure of how well the cluster represents
the newsgroup, taking into account the ‘correctness’
of the cluster (precision) and how much of the news-
group it accounts for (recall). Precision is calculated
as
a0
a1a3a2 a4
# documents in cluster a36 and newsgroup a37
# documents in cluster a36
and recall as
a1
a1a3a2 a4
# documents in cluster a36 and newsgroup a37
# documents in newsgroup a37
The F-score is then calculated as
a2
a1a3a2 a4 a40a4a3 a42a6a5 a14
a21
a0
a1a3a2
a19
a21
a1
a1a2
a23
a44a8a7a10a9
Once all the a2 a1a3a2 have been calculated, we choose
for each cluster the best newsgroup assignment, i.e.
the one with the highest F-score. As a result of this
process, multiple clusters may be assigned to the
same newsgroup, in which case they are pooled into
a single cluster. The F-score is then re-calculated for
each pooled cluster to give an overall performance
measure for these clusters.
The clustering procedure is evaluated using two
main measures: (1) the number of newsgroups that
were matched by the generated clusters (between 1
and a38 ), and (2) the F-score of the pooled clusters.
The first measure estimates how many clusters are
needed to find all the newsgroups, while the sec-
ond measure assesses the quality of these clusters.
Further, the number of clusters that are needed to
achieve an acceptable quality of performance sug-
gests the level of granularity needed to separate the
newsgroups (few clusters correspond to a coarse
level of granularity, many clusters to a fine one).
The clustering procedure is also evaluated as a
whole by calculating its overall precision, i.e. the
proportion of documents that were assigned cor-
rectly over the whole dataset. Note that the over-
all recall is the same as the overall precision, since
the denominators in both measures consist of all
the documents in the dataset. Hence, the F-score
is equal to the precision.
3.3 Example
We now show a sample output of the clustering pro-
cedure described above, with and without the filter-
ing mechanism described in Section 2. Tables 2
and 3 display the pooled clusters created without
and with filtering, respectively. These tables show
how many clusters were found for each newsgroup,
the number of documents in each pooled cluster,
and the performance of the cluster (P, R and F). The
tables also present the top 30 representative words
in each cluster (restricted to 30 due to space limita-
tions). These words are sorted in decreasing order
of their average TF.IDF score over the documents in
the cluster (words representative of a cluster should
have high TF.IDF scores, because they appear fre-
quently in the documents in the cluster, and infre-
quently in the documents in other clusters).
According to the results in Table 2, the top-30 list
for the ‘hp’ cluster does not have many signature
words. This was anticipated by the observation that
the filtering mechanism was applied very rarely to
the ‘hp’ newsgroup (Table 1). In contrast, the major-
ity of the top-30 words in the ‘tex’ cluster are signa-
ture words (some exceptions are ‘chapter’, ‘english’
and ‘examples’). We conclude that this pooled clus-
ter was created (using two different clusters) to rep-
resent the various signatures in the ‘tex’ newsgroup.
Further, a relatively small number of documents are
assigned to the ‘tex’ cluster, which therefore has a
very low recall value (0.34). Its precision is perfect,
but its low recall suggests that many of the docu-
ments representing the true topics of this newsgroup
were assigned to other clusters.
The ‘photoshop’ cluster has a very high precision
and recall, so most of the ‘photoshop’ documents
were assigned correctly. However, here too many
of the top words are signature words. Even when
the ‘obvious’ signature words are ignored (such as
URLs and people’s names), there are still words that
confuse the topics of this newsgroup, such as ‘mil-
lion’, ‘america’, ‘urban’ and ‘dragon’.
In Table 3 most of the words discovered by the
clustering procedure represent the true topics of the
newsgroups. The filtering mechanism removes the
dominant signature words, and thus the clustering
procedure is able to find the true topic-related clus-
ters (precision and recall are very high for all pooled
clusters). Notice that there are still some signature-
related words, such as ‘arseneau’ and ‘fairbairns’ in
the ‘tex’ cluster, and ‘tacit’ and ‘gifford’ in the ‘pho-
toshop’ cluster. These words correspond mainly to
a dominant individual’s name or email address, and
the filtering mechanism fails to filter them when
other individuals reply to the dominant individual
using these words. In a thread (document) contain-
ing a dominant individual, that individual’s signa-
ture words are filtered, but unless the people reply-
ing to the dominant individual are dominant them-
selves, the words they use to refer to this individual
will not be filtered, and therefore will influence the
clustering process. This highlights further the prob-
lem that our filtering mechanism is addressing, and
suggests that more filtering should be done.
4 Evaluation
The example presented in the last section pertains to
a specific run of the clustering procedure. We now
evaluate our system more generally by looking at
hp (1 cluster, 1825 documents, P=0.58, R=0.97, F=0.73)
unable, connected, hat, entry, fix, configure, lpd, configuration, parallel, psc, kernel, configured, kurt, de, taylor,
report, local, asnd@triumf.ca, grant, plain, debian, linuxprinting.org, officejet, instructions, letter, appears, update,
called, extra, compile
tex (2 clusters, 375 documents, P=1.00, R=0.34, F=0.50)
luecking, arkansas, http://www.tex.ac.uk. . . , herbert, piet, oostrum, university, heiko, lars, mathemati-
cal, department, voss, van, http://people.ee.eth. . . , sciences, madsen, rtfsignature, http://www.ctan.org/. . . ,
http://www.ams.org/t. . . , wilson, oberdiek, http://www.ctan.org/. . . , apr, examples, english, asnd@triumf.ca, chap-
ter, rf@cl.cam.ac.uk, sincerely, private
photoshop (2 clusters, 1143 documents, P=0.95, R=0.95, F=0.95)
gifford, million, jgifford@surewest.ne. . . , heinlein, www.nitrosyncretic.c. . . , john@stafford.net, america, urban,
dragon, fey, imperial, created, hard, pictures, rgb, edjh, folder, face=3darial, tutorials, professional, comic, graphic,
sketches, http://www.sover.net. . . , move, drive, wdflannery@aol.com, colors, buy, posted
Table 2: Top 30 centroid words found by the clustering procedure without filtering.
hp (2 clusters, 1162 documents, P=0.97, R=0.93, F=0.95)
lprng, connected, linuxprinting.org, kernel, red, psc, hat, configure, unable, configuration, configured, parallel,
ljet, printtool, series, database, jobs, gimp-print, debian, entry, suse, cupsomatic, officejet, cat, perfectly, jetdirect,
duplex, devices, kde, happens
tex (1 cluster, 1040 documents, P=0.98, R=0.91, F=0.95)
arseneau, ctan, fairbairns, style, miktex, pdflatex, faq, chapter, apr, symbols, dvips, figures, title, include, math,
bibtex, kastrup, university, examples, english, dvi, peter, plain, documents, contents, written, e.g, macro, robin,
donald
photoshop (2 clusters, 1287 documents, P=0.88, R=0.98, F=0.93)
tacit, james, gifford, folder, rgb, pictures, created, colors, tutorials, illustrator, window, tom, mask, money, what-
ever, newsgroup, drive, brush, plugin, professional, stafford, view, menu, palette, channel, graphic, pixel, ram,
tutorial, paint
Table 3: Top 30 centroid words found by the clustering procedure with filtering.
clustering performance for a range of values of a0 ,
and inspecting the implications of this performance
with respect to a document retrieval task.
4.1 Coarse-Level Clustering
Figure 1 shows the overall clustering performance
obtained without filtering (solid line) and with fil-
tering (dashed line). The left-hand-side of the figure
shows the average number of newsgroups matched
to clusters, while the right-hand-side shows the
overall performance (F-score) obtained. The error
bars in the plots are averages of 100 repetitions of
the clustering procedure described in Section 3.2
(with random initialisation of the centroids at the
start of each run). The widths of the error bars in-
dicate 95% confidence intervals for these averages.
Hence, non-overlapping intervals correspond to a
difference with p-value lower than 0.05.
In (Marom and Zukerman, 2004), we show that
the effect of the filtering mechanism on clustering
performance depends on three factors: (1) the pres-
ence of signature words from dominant contribu-
tors; (2) the ‘natural’, topical overlap between the
newsgroups; and (3) the level of granularity in the
clustering, i.e. the number of centroids.
The main conclusions with respect to the dataset
presented here are as follows.
a0 Firstly, there is a heavy presence of signature
words in two of the newsgroups (‘tex’ and
‘photoshop’ – see Table 1), and therefore the
filtering mechanism has a significant effect on
this dataset as a whole. As can be seen in Fig-
ure 1, the performance (F-score) without fil-
tering is poorer for all values of a0 , and sub-
stantially more so for low values of a0 . Al-
though the clustering procedure without filter-
ing is able to find three distinct newsgroups
with a0 a4 a5 , it requires a higher value of a0 to
achieve a satisfactory performance. This sug-
gests that the signature words create undesir-
able overlaps between the clusters. In contrast,
when filtering is used, the clustering procedure
reaches its best performance with a0 a4 a1 , where
the performance is extremely good.
a0 Secondly, the fact that the performance with
filtering converges for such a low value of a0
suggests that there is little true topical overlap
between the newsgroups, and the fact that the
performance is significantly better for a0 a4
a1
3 4 5 6 7 8 9 102.3
2.4
2.5
2.6
2.7
2.8
2.9
3
# newsgroups matched
k 3 4 5 6 7 8 9 10
0.5
0.6
0.7
0.8
0.9
F−score
 
k
Figure 1: Overall clustering performance.
4 6 8 100.2
0.4
0.6
0.8
1
precision
hp
4 6 8 100.2
0.4
0.6
0.8
1
tex
4 6 8 100.2
0.4
0.6
0.8
1
photoshop
4 6 8 100.2
0.4
0.6
0.8
1
recall
4 6 8 100.2
0.4
0.6
0.8
1
4 6 8 100.2
0.4
0.6
0.8
1
4 6 8 100.2
0.4
0.6
0.8
1
F−score
k
4 6 8 100.2
0.4
0.6
0.8
1
k
4 6 8 100.2
0.4
0.6
0.8
1
k
Figure 2: Clustering performance by newsgroup.
than for a0 a4a1a0 suggests that there is some
overlap, possibly created by a sub-topic of one
of the newsgroups. That is, although there are
only three newsgroups, four centroids are bet-
ter at finding them than three centroids, be-
cause the fourth centroid may correspond to
an overlap region between two clusters, which
then gets assigned to the correct newsgroup.
We can get a better insight into these results by
inspecting the individual performance of the pooled
clusters, particularly their precision and recall. Fig-
ure 2 shows the average performance of the pooled
clusters separately for each of the three newsgroups.
This figure confirms that the ‘hp’ newsgroup is the
least affected by signature words: for low values
of a0 , without filtering, the average performance (F-
score) of the pooled clusters corresponding to the
‘hp’ newsgroup is generally better than that of the
clusters corresponding to the other newsgroups (and
it even matches the performance achieved with fil-
tering for a0 a4a3a2 ). This is particularly evident when
we compare recall curves: recall for the ‘hp’ news-
group without filtering reaches the recall obtained
with filtering when a0 a4 a5 . In contrast, precision
only achieves this level of performance for higher
values of a0 — this is because some of the documents
in the ‘hp’ newsgroup are confused with documents
in the other two newsgroups.
4.2 Simple Information Retrieval
A desirable outcome for retrieval systems that per-
form document clustering prior to retrieval is that
the returned clusters contain as much useful infor-
mation as possible regarding a user’s query. If the
clustering is performed well, the words in the query
should appear in many documents in the best match-
ing cluster(s).
Our retrieval experiments consist of retrieving
documents that match three simple queries, each
comprising a word pair that occurs frequently in the
newsgroups. As before, for each experiment we re-
peated the clustering procedure 100 times and av-
eraged the results. Retrieval performance was mea-
sured as follows:
correct documents in the selected cluster
total correct documents in the dataset
where a correct document is one that contains all
the words in a query, and the selected cluster is that
whose centroid has the highest average value for
the query terms. That is, if a query comprises the
words a40a5a4 a9a5a6 a4 a12 a6 a42 a42 a42 a6 a4a8a7 a44 , and cluster a37 has a cen-
troid value a9a11a10a13a12
a2
for word a4 a1 , then the cluster that
best matches the query is the cluster a37a15a14 such that
a37 a14
a4a17a16a19a18 a10a21a20a22a16a24a23
a2
a40
a21
a25
a7
a26
a1a28a27
a9
a9a29a10a13a12
a2
a44
a42
Our measure for retrieval performance considers
only recall (i.e. how many correct documents were
found for a particular query). It does not have a pre-
cision component, because the system retrieves only
documents that contain all the words in the query.
That is, precision is always perfect.
According to Figure 2, the recall for the ‘hp’
newsgroup is equally high with and without filtering
when a0 a46 a5 , as opposed to the other newsgroups,
where the recall is significantly better with filtering
for all values of a0 . We therefore chose a0 a4 a5 to
evaluate retrieval, in order to expose the differences
between the newsgroups.
Table 4 shows the retrieval performance obtained
for the three queries, when clustering is performed
with and without filtering, and with a0 a4 a5 . The
table shows the average performance of the pooled
clusters separately for each of the three newsgroups.
Also shown for each query is the total number of
documents in the dataset that contain all the words
in the query. The average performance of the best-
matching cluster is displayed in bold font, and the
standard deviation appears in brackets next to the
performance.
The first query is related to the ‘hp’ newsgroup.
The retrieval performance of the matching cluster
filter hp tex photoshop
Query 1: letter backend (total 25)
off 0.87 (0.32) 0.07 (0.23) 0.06 (0.22)
on 0.93 (0.10) 0.05 (0.10) 0.02 (0.03)
Query 2: compile miktex (total 21)
off 0.20 (0.32) 0.67 (0.36) 0.13 (0.27)
on 0.00 (0.01) 0.99 (0.06) 0.01 (0.06)
Query 3: rgb colour (total 22)
off 0.20 (0.21) 0.11 (0.24) 0.69 (0.30)
on 0.15 (0.14) 0.02 (0.12) 0.83 (0.19)
Table 4: Queries used to evaluate the retrieval task.
for this query is high with and without the filter-
ing mechanism (the difference in performance is not
statistically significant). As discussed above, this
result is expected due to the similar recall score of
the pooled cluster obtained with and without filter-
ing for this newsgroup.
Filtering has a more significant effect for the
queries relating to the other newsgroups. Query 2
is very specific to the ‘tex’ newsgroup: when filter-
ing is used, almost all the relevant documents are re-
trieved by the corresponding cluster. The benefit of
filtering is very clear when we consider the poor re-
trieval performance when filtering is not used: 33%
of the documents are missed (the p-value for the
difference in retrieval score is a0 a3 a42 a3 a21 ). The third
query has more ambiguity (the word ‘colour’ ap-
pears in the ‘hp’ newsgroup), and therefore the over-
all retrieval performance is worse than for the other
queries. About 17% of the documents were missed
when filtering was used, most of which were allo-
cated to the ‘hp’ newsgroup. Nevertheless, the fil-
tering mechanism has a significant effect even for
this ambiguous query (p-value=0.03).
5 Conclusion
In this paper, we have identified features of elec-
tronic discussions that influence clustering perfor-
mance, and presented a filtering mechanism that re-
moves adverse influences. The effect of our filtering
mechanism was evaluated by means of two experi-
ments: coarse-level clustering and simple informa-
tion retrieval. Our results show that filtering out the
signature words of dominant speakers has a posi-
tive effect on clustering and retrieval performance.
Although these experiments were performed at a
coarser level of granularity than that of our target
domain, our results indicate that filtering signature
words is a promising pre-processing step for clus-
tering electronic discussions.
From a more qualitative perspective, we clearly
saw the benefit of the filtering mechanism in the ex-
ample in Section 3.3 (Tables 2 and 3): when a gen-
eration component is used to describe the contents
of clusters, the inclusion of author-specific words is
uninformative and even confusing.
Our approach to filtering is general in the sense
that we do not target specific parts of electronic dis-
cussions (e.g. the last few lines of a posting) for
filtering. We have experimented with a more naive
approach that removes all web and email addresses
from a posting (they account for a significant por-
tion of a signature). However, this simple heuris-
tic yielded only a small improvement in clustering
performance. More importantly, it clearly does not
generalise to deal with the problem of identifying
and removing author-specific terminology.
6 Acknowledgments
This research was supported in part by grant
LP0347470 from the Australian Research Council
and by an endowment from Hewlett Packard.

References

Abdelmonem Abdelaziz Afifi and Virginia Ann
Clark. 1996. Computer-Aided Multivariate
Analysis. Chapman & Hall, London.

Yuval Marom and Ingrid Zukerman. 2004. Im-
proving newsgroup clustering by filtering author-
specific words. In PRICAI’04 – Proceedings of
the 8th Pacific Rim International Conference on
Artificial Intelligence, Auckland, New Zealand.

J. L. Neto, A. D. Santos, C. A. A. Kaestner, and
A. A. Freitas. 2000. Document clustering and
text summarization. In PAKDD-2000 – Proceed-
ings of the 4th International Conference on Prac-
tical Applications of Knowledge Discovery and
Data Mining, pages 41–55, London, UK.

G. Salton and M.J. McGill. 1983. An Introduction
to Modern Information Retrieval. McGraw Hill.

Gerald Salton. 1971. Cluster search strategies and
the optimization of retrieval effectiveness. In
Gerald Salton, editor, The SMART Retrieval Sys-
tem — Experiments in Automatic Document Pro-
cessing, pages 223–242. Prentice-Hall, Inc., En-
glewood Cliffs, NJ.

Hinrich Sch¨utze and Jan O. Pedersen. 1995. Infor-
mation retrieval based on word senses. In Pro-
ceedings of the 4th Annual Symposium on Doc-
ument Analysis and Information Retrieval, pages
161–175, Las Vegas, Nevada.

Oren Zamir and Oren Etzioni. 1998. Web docu-
ment clustering: A feasibility demonstration. In
SIGIR’98 – Proceedings of the 21st ACM Inter-
national Conference on Research and Develop-
ment in Information Retrieval, pages 46–54, Mel-
bourne, Australia.
