File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/p05-1037_metho.xml

Size: 16,781 bytes

Last Modified: 2025-10-06 14:09:41

<?xml version="1.0" standalone="yes"?>
<Paper uid="P05-1037">
  <Title>Digesting Virtual &amp;quot;Geek&amp;quot; Culture: The Summarization of Technical Internet Relay Chats</Title>
  <Section position="4" start_page="298" end_page="299" type="metho">
    <SectionTitle>
3 Technical Internet Relay Chats
</SectionTitle>
    <Paragraph position="0"> GNUe, a meta-project of the GNU project  -one of the most famous free/open source software projects-is the case study used in (Elliott and Scacchi, 2004) in support of the claim that, even in virtual organizations, there is still the need for successful conflict management in order to maintain order and stability.</Paragraph>
    <Paragraph position="1"> The GNUe IRC archive is uniquely suited for our experimental purpose because each IRC chat log has a companion summary digest written by project participants as part of their contribution to the community. This manual summary constitutes gold-standard data for evaluation.</Paragraph>
    <Section position="1" start_page="298" end_page="298" type="sub_section">
      <SectionTitle>
3.1 Kernel Traffic
</SectionTitle>
      <Paragraph position="0"> Kernel Traffic is a collection of summary digests of discussions on GNUe development. Each digest summarizes IRC logs and/or email messages (later referred to as chat logs) for a period of up to two weeks. A nice feature is that direct quotes and hyperlinks are part of the summary. Each digest is an extractive overview of facts, plus the author's dramatic and humorous interpretations.</Paragraph>
    </Section>
    <Section position="2" start_page="298" end_page="298" type="sub_section">
      <SectionTitle>
3.2 Corpus Download
</SectionTitle>
      <Paragraph position="0"> The complete Linux Kernel Archive (LKA) consists of two separate downloads. The Kernel Traffic (summary digests) are in XML format and were downloaded by crawling the Kernel Traffic site.</Paragraph>
      <Paragraph position="1"> The Linux Kernel Archives (individual IRC chat logs) are downloaded from the archive site. We matched the summaries with their respective chat logs based on subject line and publication dates.</Paragraph>
    </Section>
    <Section position="3" start_page="298" end_page="299" type="sub_section">
      <SectionTitle>
3.3 Observation on Chat Logs
</SectionTitle>
      <Paragraph position="0"> Upon initial examination of the chat logs, we found that many conventional assumptions about chats in general do not apply. For example, in most instant-message chats, each exchange usually consists of a small number of words in several sentences. Due to the technical nature of GNUe, half of the chat logs contain in-depth discussions with lengthy messages. One message might ask and answer several questions, discuss many topics in detail, and make further comments. This property, which we call subtopic structure, is an important difference from informal chat/interpersonal banter.</Paragraph>
      <Paragraph position="1"> Figure 1 shows the subtopic structure and relation of the first 4 messages from a chat log, produced manually. Each message is represented horizontally; the vertical arrows show where participants responded to each other. Visual inspection reveals in this example there are three distinctive clusters (a more complex cluster and two smaller satellite clusters) of discussions between participants at sub-message level.</Paragraph>
    </Section>
    <Section position="4" start_page="299" end_page="299" type="sub_section">
      <SectionTitle>
3.4 Observation on Summary Digests
</SectionTitle>
      <Paragraph position="0"> To measure the goodness of system-produced summaries, gold standards are used as references.</Paragraph>
      <Paragraph position="1"> Human-written summaries usually make up the gold standards. The Kernel Traffic (summary digests) are written by Linux experts who actively contribute to the production and discussion of the open source projects. However, participantproduced digests cannot be used as reference summaries verbatim. Due to the complex structure of the dialogue, the summary itself exhibits some discourse structure, necessitating such reader guidance phrases such as &amp;quot;for the ... question,&amp;quot; &amp;quot;on the ... subject,&amp;quot; &amp;quot;regarding ...,&amp;quot; &amp;quot;later in the same thread,&amp;quot; etc., to direct and refocus the reader's attention. Therefore, further manual editing and partitioning is needed to transform a multi-topic digest into several smaller subtopic-based gold-standard reference summaries (see Section 6.1 for the transformation). null</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="299" end_page="300" type="metho">
    <SectionTitle>
4 Fine-grained Clustering
</SectionTitle>
    <Paragraph position="0"> To model the subtopic structure of each chat message, we apply clustering at the sub-message level.</Paragraph>
    <Section position="1" start_page="299" end_page="299" type="sub_section">
      <SectionTitle>
4.1 Message Segmentation
</SectionTitle>
      <Paragraph position="0"> First, we look at each message and assume that each participant responds to an ongoing discussion by stating his/her opinion on several topics or issues that have been discussed in the current chat log, but not necessarily in the order they were discussed. Thus, topic shifts can occur sequentially within a message. Messages are partitioned into multi-paragraph segments using TextTiling, which reportedly has an overall precision of 83% and recall of 78% (Hearst, 1994).</Paragraph>
    </Section>
    <Section position="2" start_page="299" end_page="299" type="sub_section">
      <SectionTitle>
4.2 Clustering
</SectionTitle>
      <Paragraph position="0"> After distinguishing a set of message segments, we cluster them. When choosing an appropriate clustering method, because the number of subtopics under discussion is unknown, we cannot make an assumption about the total number of resulting clusters. Thus, nonhierarchical partitioning methods cannot be used, and we must use a hierarchical method. These methods can be either agglomerative, which begin with an unclustered data set and perform N - 1 pairwise joins, or divisive, which add all objects to a single cluster, and then perform N - 1 divisions to create a hierarchy of smaller clusters, where N is the total number of items to be clustered (Frakes and Baeza-Yates, 1992).</Paragraph>
    </Section>
    <Section position="3" start_page="299" end_page="300" type="sub_section">
      <SectionTitle>
Ward's Method
</SectionTitle>
      <Paragraph position="0"> Hierarchical agglomerative clustering methods are commonly used and we employ Ward's method (Ward and Hook, 1963), in which the text segment pair merged at each stage is the one that minimizes the increase in total within-cluster variance.</Paragraph>
      <Paragraph position="1"> Each cluster is represented by an L-dimensional</Paragraph>
      <Paragraph position="3"> is the number of objects in the cluster, the squared Euclidean distance between two segments i and j is:  When two segments are joined, the increase in</Paragraph>
      <Paragraph position="5"> Number of Clusters The process of joining clusters continues until the combination of any two clusters would destabilize the entire array of currently existing clusters produced from previous stages. At each stage, the two</Paragraph>
      <Paragraph position="7"> are chosen whose combination would cause the minimum increase in variance I</Paragraph>
      <Paragraph position="9"> expressed as a percentage of the variance change from the last round. If this percentage reaches a preset threshold, it means that the nearest two clusters are much further from each other compared to the previous round; therefore, joining of the two represents a destabilizing change, and should not take place.</Paragraph>
      <Paragraph position="10"> Sub-message segments from resulting clusters are arranged according to the sequence the original messages were posted and the resulting subtopic structures are similar to the one shown in Figure 1.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="300" end_page="302" type="metho">
    <SectionTitle>
5 Summary Extraction
</SectionTitle>
    <Paragraph position="0"> Having obtained clusters of message segments focused on subtopics, we adopt the typical summarization paradigm to extract informative sentences and segments from each cluster to produce subtopic-based summaries. If a chat log has n clusters, then the corresponding summary will contain n mini-summaries.</Paragraph>
    <Paragraph position="1"> All message segments in a cluster are related to the central topic, but to various degrees. Some are answers to questions asked previously, plus further elaborative explanations; some make suggestions and give advice where they are requested, etc.</Paragraph>
    <Paragraph position="2"> From careful analysis of the LKA data, we can safely assume that for this type of conversational interaction, the goal of the participants is to seek help or advice and advance their current knowledge on various technical subjects. This kind of interaction can be modeled as one probleminitiating segment and one or more corresponding problem-solving segments. We envisage that identifying corresponding message segment pairs will produce adequate summaries. This analysis follows the structural organization of summaries from Kernel Traffic. Other types of discussions, at least in part, require different discourse/summary organization. null These corresponding pairs are formally introduced below, and the methods we experimented with for identifying them are described.</Paragraph>
    <Section position="1" start_page="300" end_page="300" type="sub_section">
      <SectionTitle>
5.1 Adjacent Response Pairs
</SectionTitle>
      <Paragraph position="0"> An important conversational analysis concept, adjacent pairs (AP), is applied in our system to identify initiating and responding correspondences from different participants in one chat log. Adjacent pairs are considered fundamental units of conversational organization (Schegloff and Sacks, 1973). An adjacent pair is said to consist of two parts that are ordered, adjacent, and produced by different speakers (Galley et al., 2004). In our email/chat (LKA) corpus a physically adjacent message, following the timeline, may not directly respond to its immediate predecessor. Discussion participants read the current live thread and decide what he/she would like to correspond to, not necessarily in a serial fashion. With the added complication of subtopic structure (see Figure 1) the definition of adjacency is further violated. Due to its problematic nature, a relaxation on the adjacency requirement is used in extensive research in conversational analysis (Levinson, 1983). This relaxed requirement is adopted in our research.</Paragraph>
      <Paragraph position="1"> Information produced by adjacent correspondences can be used to produce the subtopic-based summary of the chat log. As described in Section 4, each chat log is partitioned, at sub-message level, into several subtopic clusters. We take the message segment that appears first chronologically in the cluster as the topic-initiating segment in an adjacent pair. Given the initiating segment, we need to identify one or more segments from the same cluster that are the most direct and relevant responses. This process can be viewed equivalently as the informative sentence extraction process in conventional text-based summarization.</Paragraph>
    </Section>
    <Section position="2" start_page="300" end_page="301" type="sub_section">
      <SectionTitle>
5.2 AP Corpus and Baseline
</SectionTitle>
      <Paragraph position="0"> We manually tagged 100 chat logs for adjacent pairs. There are, on average, 11 messages per chat log and 3 segments per message (This is considerably larger than threads used in previous research). Each chat log has been clustered into one or more bags of message segments. The message segment that appears earliest in time in a cluster  was marked as the initiating segment. The annotators were provided with this segment and one other segment at a time, and were asked to decide whether the current message segment is a direct answer to the question asked, the suggestion that was requested, etc. in the initiating segment. There are 1521 adjacent response pairs; 1000 were used for training and 521 for testing.</Paragraph>
      <Paragraph position="1"> Our baseline system selects the message segment (from a different author) immediately following the initiating segment. It is quite effective, with an accuracy of 64.67%. This is reasonable because not all adjacent responses are interrupted by messages responding to different earlier initiating messages.</Paragraph>
      <Paragraph position="2"> In the following sections, we describe two machine learning methods that were used to identify the second element in an adjacent response pair and the features used for training. We view the problem as a binary classification problem, distinguishing less relevant responses from direct responses. Our approach is to assign a candidate message segment c an appropriate response class r.</Paragraph>
    </Section>
    <Section position="3" start_page="301" end_page="301" type="sub_section">
      <SectionTitle>
5.3 Features
</SectionTitle>
      <Paragraph position="0"> Structural and durational features have been demonstrated to improve performance significantly in conversational text analysis tasks. Using them, Galley et al. (2004) report an 8% increase in speaker identification. Zechner (2001) reports excellent results (F &gt; .94) for inter-turn sentence boundary detection when recording the length of pause between utterances. In our corpus, durational information is nonexistent because chats and emails were mixed and no exact time recordings beside dates were reported. So we rely solely on structural and lexical features.</Paragraph>
      <Paragraph position="1"> For structural features, we count the number of messages between the initiating message segment and the responding message segment. Lexical features are listed in Table 1. The tech words are the words that are uncommon in conventional literature and unique to Linux discussions.</Paragraph>
    </Section>
    <Section position="4" start_page="301" end_page="301" type="sub_section">
      <SectionTitle>
5.4 Maximum Entropy
</SectionTitle>
      <Paragraph position="0"> Maximum entropy has been proven to be an effective method in various natural language processing applications (Berger et al., 1996). For training and testing, we used YASMET</Paragraph>
      <Paragraph position="2"> (c) is a normalizing constant and the feature function for feature f</Paragraph>
      <Paragraph position="4"> and response class r is defined as:</Paragraph>
      <Paragraph position="6"> response class r. Then, to determine the best class r for the candidate message segment c, we have:</Paragraph>
      <Paragraph position="8"/>
    </Section>
    <Section position="5" start_page="301" end_page="302" type="sub_section">
      <SectionTitle>
5.5 Support Vector Machine
</SectionTitle>
      <Paragraph position="0"> Support vector machines (SVMs) have been shown to outperform other existing methods (naive Bayes, k-NN, and decision trees) in text categorization (Joachims, 1998). Their advantages are robustness and the elimination of the need for feature selection and parameter tuning. SVMs find the hyper-plane that separates the positive and negative training examples with maximum margin. Finding this hyperplane can be translated into an optimization problem of finding a set of coefficients a</Paragraph>
      <Paragraph position="2"> the weight vector</Paragraph>
      <Paragraph position="4"> Testing data are classified depending on the side of the hyperplane they fall on. We used the</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="302" end_page="302" type="metho">
    <SectionTitle>
5.6 Results
</SectionTitle>
    <Paragraph position="0"> Entries in Table 2 show the accuracies achieved using machine learning models and feature sets.</Paragraph>
    <Section position="1" start_page="302" end_page="302" type="sub_section">
      <SectionTitle>
5.7 Summary Generation
</SectionTitle>
      <Paragraph position="0"> After responding message segments are identified, we couple them with their respective initiating segment to form a mini-summary based on their subtopic. Each initializing segment has zero or more responding segments. We also observed zero response in human-written summaries where participants initiated some question or concern, but others failed to follow up on the discussion. The AP process is repeated for each cluster created previously. One or more subtopic-based mini-summaries make up one final summary for each chat log. Figure 2 shows an example. For longer chat logs, the length of the final summary is arbitrarily averaged at 35% of the original.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML