File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/c04-1068_intro.xml
Size: 6,009 bytes
Last Modified: 2025-10-06 14:02:04
<?xml version="1.0" standalone="yes"?> <Paper uid="C04-1068"> <Title>Filtering Speaker-Specific Words from Electronic Discussions</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> The ability to draw on past experience is often useful in information-providing applications. For instance, users who interact with help-desk applications would benefit from the availability of relevant contextual information about their request, e.g., from previous, similar interactions between the system and other users, or from interactions between domain experts.</Paragraph> <Paragraph position="1"> The work reported in this paper is the first step in a project which aims to provide such information. The eventual objective of our project is to automatically identify related interactions in help-desk applications, and to generate summaries from their combined experience. These summaries would then assist both users and operators.</Paragraph> <Paragraph position="2"> Our approach to the identification of related interactions hinges on the application of clustering techniques. These techniques have been used in Information Retrieval for some time (e.g. Salton, 1971). They involve grouping a set of related documents, and then using a representative element to match input queries (as opposed to matching the whole collection of documents). Document clustering has been used in search engine applications to improve and speed up retrieval (e.g. Zamir and Etzioni, 1998), but also for more descriptive purposes, such as using representative elements of a cluster to generate lists of keywords (Neto et al., 2000).</Paragraph> <Paragraph position="3"> However, discussions (and dialogues in general) have distinguishing features which make clustering a corpus of such interactions a more challenging task than clustering plain documents. These features are: (1) the corpus consists of contributions made by a community of authors, or &quot;speakers&quot;; (2) certain speakers are more dominant in the corpus; and (3) speakers often use idiosyncratic, speaker-specific language, or make comments that are not about the task at hand.</Paragraph> <Paragraph position="4"> In this paper, we report on a preliminary study where we cluster discussions carried out in electronic newsgroups. Specifically, we report on the influence of the above features on the clustering process, and describe a filtering mechanism that identifies and removes undesirable influences.</Paragraph> <Paragraph position="5"> Table 1 shows the newsgroups used as data sources in our experiments. These newsgroups were obtained from the Internet. The table shows the number of threads in each newsgroup, the number of people posting to the newsgroups, and the highest number of postings by an individual for each newsgroup. It also shows the impact of the filtering mechanism on each newsgroup (Section 2).</Paragraph> <Paragraph position="6"> The clustering process and filtering mechanism were evaluated by means of two experiments: (1) coarse-level clustering, and (2) simple information retrieval.</Paragraph> <Paragraph position="7"> Coarse-level clustering. This experiment consists of merging the discussion threads (documents) in different newsgroups into a single dataset, and applying a clustering mechanism to separate them.</Paragraph> <Paragraph position="8"> The performance of the clustering mechanism is then evaluated by how well the generated clusters match the original newsgroups from which the discussion threads were obtained. Clearly, this evaluation is at a coarser level of granularity than that required for our final system. However, we find it useful for the following reasons: newsgroup number of number of most frequent average filter threads people number of postings usage (per thread) a0 Owing to the number and diversity of newsgroups on the Internet, we can perform controlled experiments where we vary the degree of similarity between newsgroups, thereby simulating discussions with different levels of relatedness.</Paragraph> <Paragraph position="9"> a0 Our experiments show that our filtering mechanism has a positive influence at different levels of granularity (Section 4). Hence, there is reason to expect that this influence will remain for finer levels of granularity, e.g., the level of a task or request.</Paragraph> <Paragraph position="10"> a0 Finally, the different newsgroups are identified in advance, which obviates the need for manual discussion-tagging at this stage.</Paragraph> <Paragraph position="11"> Due to space limitations, we report only on a sub-set of our experiments. In (Marom and Zukerman, 2004) we present a comparative study that considers different sets of newsgroups of varying levels of relatedness. We regard the set of newsgroups presented here as having a &quot;medium&quot; level of relatedness. null Simple information retrieval. This experiment constitutes a simplistic and restricted version of the document retrieval functionality envisaged for our eventual system. In this experiment, we matched pairs of query terms to the centroids of the generated clusters, and assessed the system's ability to retrieve relevant discussion threads from the best-matching cluster, with and without filtering. The experiment makes the implicit assumption that the corpus contains discussions relevant to incoming requests, i.e. that new requests are similar to old ones. We believe that the results of this restricted experiment are indicative of future system performance, as the envisaged system is also expected to operate under this assumption.</Paragraph> <Paragraph position="12"> Next, we describe our filtering mechanism. Section 3 describes the clustering procedure, including our data representation and cluster identification method. Section 4 presents the results from our experiments, and Section 5 concludes the paper.</Paragraph> </Section> class="xml-element"></Paper>