File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/n06-1027_metho.xml
Size: 10,258 bytes
Last Modified: 2025-10-06 14:10:12
<?xml version="1.0" standalone="yes"?> <Paper uid="N06-1027"> <Title>Learning to Detect Conversation Focus of Threaded Discussions</Title> <Section position="4" start_page="209" end_page="209" type="metho"> <SectionTitle> 3 Conversation Focus Detection </SectionTitle> <Paragraph position="0"> In threaded discussions, people participate in a conversation by posting messages. Our goal is to be able to detect which message in a thread contains the most important information, i.e., the focus of the conversation. Unlike traditional IR systems, which return a ranked list of messages from a flat document set, our task must take into account characteristics of threaded discussions.</Paragraph> <Paragraph position="1"> First, messages play certain roles and are related to each other by a conversation context. Second, messages written by different authors may vary in value. Finally, since postings occur in parallel, by various people, message threads are not necessarily coherent so the lexical similarity among the messages should be analyzed. To detect the focus of conversation, we integrate a pragmatics study of conversational speech acts, an analysis of message values based on poster trustworthiness and an analysis of lexical similarity. The subsystems that determine these three sources of evidence comprise the features of our feature-based system.</Paragraph> <Paragraph position="2"> Because each discussion thread is naturally represented by a directed graph, where each message is represented by a node in the graph, we can apply a graph-based algorithm to integrate these sources and detect the focus of conversation.</Paragraph> <Section position="1" start_page="209" end_page="209" type="sub_section"> <SectionTitle> 3.1 Thread Representation </SectionTitle> <Paragraph position="0"> A discussion thread consists of a set of messages posted in chronological order. Suppose that each message is represented by m</Paragraph> <Paragraph position="2"> the entire thread is a directed graph that can be represented by G= (V, E), where V is the set of nodes</Paragraph> <Paragraph position="4"> ,i=1,...,n}, and E is the set of directed edges. In our approach, the set V is automatically constructed as each message joins in the discussion. E is a subset of VxV. We will discuss the feature-oriented link generation functions that construct the set E in Section 4.</Paragraph> <Paragraph position="5"> We make use of speech act relations in generating the links. Once a speech act relation is identified between two messages, links will be generated using generation functions described in next section. When m</Paragraph> <Paragraph position="7"> points to (i.e., children of m</Paragraph> <Paragraph position="9"/> </Section> <Section position="2" start_page="209" end_page="209" type="sub_section"> <SectionTitle> 3.2 Graph-Based Ranking Algorithm: HITS </SectionTitle> <Paragraph position="0"> Graph-based algorithms can rank a set of objects in a collective way and the affect between each pair can be propagated into the whole graph iteratively.</Paragraph> <Paragraph position="1"> Here, we use a weighted HITS (Kleinberg, 1999) algorithm to conduct message ranking.</Paragraph> <Paragraph position="2"> Kleinberg (1999) initially proposed the graph-based algorithm HITS for ranking a set of web pages. Here, we adjust the algorithm for the task of ranking a set of messages in a threaded discussion.</Paragraph> <Paragraph position="3"> In this algorithm, each message in the graph can be represented by two identity scores, hub score and authority score. The hub score represents the quality of the message as a pointer to valuable or useful messages (or resources, in general). The authority score measures the quality of the message as a resource itself. The weighted iterative updating computations are shown in Equations 1 and 2.</Paragraph> <Paragraph position="5"> where r and r+1 are the numbers of iterations.</Paragraph> <Paragraph position="6"> The number of iterations required for HITS to converge depends on the initialization value for each message node and the complexity of the graph. Graph links can be induced with extra knowledge (e.g. Kurland and Lee, 2005). To help integrate our heterogeneous sources of evidence with our graph-based HITS algorithm, we introduce link generation functions for each of the three features, (g i , i=1, 2, 3), to add links between messages. null</Paragraph> </Section> </Section> <Section position="5" start_page="209" end_page="211" type="metho"> <SectionTitle> 4 Feature-Oriented Link Generation </SectionTitle> <Paragraph position="0"> Conversation structures have received a lot of attention in the linguistic research community (Levinson, 1983). In order to integrate conversational features into our computational model, we must convert a qualitative analysis into quantitative scores. For conversation analysis, we adopted the theory of Speech Acts proposed by (Austin, 1962; Searle, 1969) and defined a set of speech acts (SAs) that relate every pair of messages in the corpus.</Paragraph> <Paragraph position="1"> Though a pair of messages may only be labeled with one speech act, a message can have multiple SAs with other messages.</Paragraph> <Paragraph position="2"> We group speech acts by function into three categories, as shown in Figure 1. Messages may involve a request (REQ), provide information (INF), or fall into the category of interpersonal (INTP) relationship. Categories can be further divided into several single speech acts.</Paragraph> <Paragraph position="3"> The SA set for our corpus is given in Table 1. A speech act may a represent a positive, negative or neutral response to a previous message depending on its attitude and recommendation. We classify each speech act as a direction as POSITIVE (+), NEGATIVE ([?]) or NEUTRAL, referred to as SA Direction, as shown in the right column of Table 1. The features we wish to include in our approach are lexical similarity between messages, poster trustworthiness, and speech act labels between message pairs in our discussion corpus.</Paragraph> <Paragraph position="4"> The feature-oriented link generation is conducted in two steps. First, our approach examines in turn all the speech act relations in each thread and generates two types of links based on lexical similarity and SA strength scores. Second, the system iterates over all the message nodes and assigns each node a self-pointing link associated with its poster trustworthiness score. The three features are integrated into the thread graph accordingly by the feature-oriented link generation functions. Multiple links with the same start and end points are combined into one.</Paragraph> <Section position="1" start_page="210" end_page="211" type="sub_section"> <SectionTitle> 4.1 Lexical Similarity </SectionTitle> <Paragraph position="0"> Discussions are constructed as people express ideas, opinions, and thoughts, so that the text itself contains information about what is being discussed. Lexical similarity is an important measure for distinguishing relationships between message pairs. In our approach, we do not compute the lexical similarity of any arbitrary pair of messages, instead, we consider only message pairs that are present in the speech act set. The cosine similarity between each message pair is computed using the TF*IDF technique (Salton, 1989).</Paragraph> <Paragraph position="1"> Messages with similar words are more likely to be semantically-related. This information is represented by term frequency (TF). However, those with more general terms may be unintentionally biased when only TF is considered so Inverse Document Frequency (IDF) is introduced to mitigate the bias. The lexical similarity score can be calculated using their cosine similarity.</Paragraph> <Paragraph position="3"> For a given a speech act, SA</Paragraph> <Paragraph position="5"> The new generated link is added to the thread graph connecting message node m</Paragraph> <Paragraph position="7"/> </Section> <Section position="2" start_page="211" end_page="211" type="sub_section"> <SectionTitle> 4.2 Poster Trustworthiness </SectionTitle> <Paragraph position="0"> Messages posted by different people may have different degrees of trustworthiness. For example, students who contributed to our corpus did not seem to provide messages of equal value. To determine the trustworthiness of a person, we studied the responses to their messages throughout the entire corpus. We used the percentage of POSITIVE responses to a person's messages to measure that person's trustworthiness. In our case, POSITIVE responses, which are defined above, included SUP, COMP, and ACK. In addition, if a person's message closed a discussion, we rated it POSITIVE.</Paragraph> <Paragraph position="1"> Suppose the poster is represented by k person , the poster score,</Paragraph> <Paragraph position="3"> For a given single speech act, SA</Paragraph> <Paragraph position="5"> The generated link is self-pointing, and contains the strength of the poster information.</Paragraph> </Section> <Section position="3" start_page="211" end_page="211" type="sub_section"> <SectionTitle> 4.3 Speech Act Analysis </SectionTitle> <Paragraph position="0"> We compute the strength of each speech act in a generative way, based on the author and trustworthiness of the author. The strength of a speech act is a weighted average over all authors.</Paragraph> <Paragraph position="2"> where the sign function of direction is defined with and projected to [0, 1]. For a given speech act,</Paragraph> <Paragraph position="4"> ), the generation function will generate a weighted link in the thread graph as expressed in</Paragraph> <Paragraph position="6"> The SA scores represent the strength of the relationship between the messages. Depending on the direction of the SA, the generated link will either go from message m</Paragraph> <Paragraph position="8"> (i.e., to itself). If the SA is NEUTRAL, the link will point to itself and the score is a recommendation to itself. Otherwise, the link connects two different messages and represents the recommendation degree of the parent to the child message.</Paragraph> </Section> </Section> class="xml-element"></Paper>