File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/00/w00-0409_intro.xml
Size: 11,285 bytes
Last Modified: 2025-10-06 14:00:57
<?xml version="1.0" standalone="yes"?> <Paper uid="W00-0409"> <Title>Multi-document Summarization by Visualizing Topical Content</Title> <Section position="4" start_page="83" end_page="87" type="intro"> <SectionTitle> 4 Implementation </SectionTitle> <Paragraph position="0"> In this section, we describe the implementation of our prototype system. The overall process flow of this system is shown in Figure 3. Our description omits the process Of creating graphical presentation that is straightforwardly understood from Section 3. The system takes, as its input, the text of a given set of documents. Throughout this section, we use the three small 'documents' shown below as an illustrative example. The data flow from these three documents to the final output is shown in Figure 4.</Paragraph> <Section position="1" start_page="83" end_page="85" type="sub_section"> <SectionTitle> 4.1 Term extraction </SectionTitle> <Paragraph position="0"> First, we extract all terms contained in the documents, using an infrastructure for document processing and analysis, comprising a number of interconnected, and mutually enabling, linguistic filters; which operates without any reference to a pre-defined domain. The whole infrastructure (hereafter referred to as TEXTRACT) is designed from the ground up to perform a variety of linguistic feature extraction functions, ranging from straightforward, single pass, tokenization, lexical look-up and morphological analysis, to complex aggregation of representative (salient) phrasal units across large multi-document collections (Boguraev and Neff, 2000).</Paragraph> <Paragraph position="1"> TEXTRACT combines functions for linguistic analysis, filtering, and normalization; these focus on morphological processing, named entity identification, technical terminology extraction, and other multi-word phrasal analysis; and are further enhanced by cross-document aggregation, resulting in some nor-</Paragraph> <Paragraph position="3"> document text # 1 : Mary Jones has a little lamb. (sl) The lamb is her good buddy. (s2) #2: Mary Jones is a veterinarian for ABC University. (s3) ABC University has many lambs. (s4) #3: Mike Smith is a programmer for XYZ corporation. (sS) term.document vectors Input: term-document vectors dr, ..., dn Output: conversion matrix C</Paragraph> <Paragraph position="5"> malization to canonical forms, and simple types of co-reference resolution.</Paragraph> <Paragraph position="6"> For the example mini-documents above, after removal of common stop words, the terms remaining as linguistic objects for the algorithm to operate on are listed at top of Figure 4.</Paragraph> </Section> <Section position="2" start_page="85" end_page="85" type="sub_section"> <SectionTitle> 4.2 Vector creation </SectionTitle> <Paragraph position="0"> We construct the semantic space from term-document relationships by a procedure 2 shown in Figure 5. In the semantic space, each of vector elements represents a linear combination of terms. The conversion matrix returned by the semantic space creation procedure keeps the information of these linear combinations. For instance, the conversion matrix for our example (see Figure 4) shows that the first element of a vector in the semantic space is associated with 0.45,&quot;Mary Jones&quot;+0.22*&quot;iittle&quot;+0.67*&quot;lamb&quot;+0.22,. &quot;good buddy&quot;+0.22,&quot;veterinarian&quot;+0.45,&quot;ABC University&quot;. null To map the documents to the vectors in the semantic space, we create the term-document vectors each of whose elements represents the degree of relevance of each term to the document. Our imple* mentation uses term frequency as the degree of relevance. We create document vectors of the semantic space by multiplying term-document vectors and the conversion matrix. Sentences and terms can also be mapped to the vectors in the same way by treating them as &quot;small documents&quot;.</Paragraph> <Paragraph position="1"> 2We do not describe the details of this procedure in this paper. See Section 2.</Paragraph> </Section> <Section position="3" start_page="85" end_page="86" type="sub_section"> <SectionTitle> 4.3 Identifying topics </SectionTitle> <Paragraph position="0"> Ultimately, our multi-document summaries rely crucially on identifying topics representing all the documents in the set. This is done by creating topic vectors so that each document vector is close to (i.e.</Paragraph> <Paragraph position="1"> represented by) at least one topic vector. We implement this topic vector creation process as follows.</Paragraph> <Paragraph position="2"> First, we create a document graph from the document vectors. In the document graph, each node represents a document vector, and two nodes have an edge between them if and only if the similarity between the two document vectors is above a threshold. Next, we detect the connected components in the document graph, and we create the topic vectors from each connected component by applying the procedure 'DetectTopic' (Figure 6) recursively. null 'DetectTopie' works as follows. The unit eigenvector of a covariance matrix of the document vectors in a set ,.q is computed as v. It is a representative direction of the document vectors in S. If the similarity between v and any document vector in S is below a threshold, then S is divided into two sets $1 and ,-q2 (as in Figure 7), and the procedure is called for $1 and $2 recursively. Otherwise, v is returned as a topic vector. The granularity of topic detection can be adjusted by the setting of threshold parameters.</Paragraph> <Paragraph position="3"> Note that such a topic vector creation procedure essentially detects &quot;cluster centroids&quot; of document vectors (not sentence vectors), although grouping documents into clusters is not our purpose. This indicates that general vector-based clustering technologies could be integrated into our framework if</Paragraph> <Paragraph position="5"> it brings further improvement.</Paragraph> </Section> <Section position="4" start_page="86" end_page="86" type="sub_section"> <SectionTitle> 4.4 Associations between topics and linguistic objects </SectionTitle> <Paragraph position="0"> The associations between topics and linguistic objects (documents, sentences, and terms) are. measured by computing the cosine (similarity measurement) between the topic vectors and linguistic object vectors. The degree of association between topics and documents is used to create document maps.</Paragraph> <Paragraph position="1"> The terms and sentences with the strongest associations are chosen to be the topic terms and the topic sentences, respectively.</Paragraph> <Paragraph position="2"> As a result, for our example we get the output shown at the bottom of Figure 4.</Paragraph> </Section> <Section position="5" start_page="86" end_page="87" type="sub_section"> <SectionTitle> 4.5 Computational complexity </SectionTitle> <Paragraph position="0"> Let m be the number of different terms in the document set (typically around 5000), and let n be the number of documents (typically 50 to 100) 3. Given that ra 3> n, the semantic space is constructed in O(mn 2) time. The topic vectors are created in O(n 3) time by using a separator tree for the computation of all-pairs minimum cut 4, assuming that the document vector set is divided evenly 5. Let k be the dimensionality of the semantic space, and let h be the number of detected topics. Note that k and h are at most n, but are generally much smaller than n in practice. Regarding the number of terms contained in one sentence as a constant, topic sentences are ext:racted in O(skh) time where s is the total number of sentences in the document set. Topic terms are extracted in O(mkh) time. We note that the prototype system runs efficiently enough for an interac- null tive system.</Paragraph> <Paragraph position="1"> &quot;5 Conclusion and further work * This paper proposes a framework for multiple doc- null umet~t summarization that leverages graphical elements to present a summary as a 'constellation' of topical highlights, tn this framework, we detect topics underlying a given documont collection, and we describe the topics by extracting related terms and sentences from the document text. Relationships~among topics and documents are graphically presented using gradation of color and placement of image objects. We illustrate interactions with Input: a set of document vectors S Output: topic vectors v = the unit eigenvector ofa covariance matrix of document vectors in S Loop for each document vector d in S if similarity between d and v is below a threshold sented here derives its strength in equal part from two components: the results of topical analysis of the document collection are displayed by means of a multi-perspective graphical interface specifically designed to highlight this analysis. Within such a philosophy for multi-document summarization, sub-components of the analysis technology can be modularly swapped in and replaced, without contradicting the overall approach.</Paragraph> <Paragraph position="2"> The algorithms and subsystems comprising the document collection analysis component have been implemented and are fully operational. The paper described one possible interface, focusing on certain visual metaphors for highlighting collection topics. As this is work in progress, we plan to experiment with alternative presentation metaphors. We plan to carry out user studies, to evaluate the interface in general, and to determine optimal features, best suited to representing our linguistic object analysis and supporting navigation through query results.</Paragraph> <Paragraph position="3"> Other future work will focus on determining the effects of analyzing linguistic objects to different level of granularity on the overall results. Questions to consider here, for instance, would he: what is the optimal definition of a term for this application; does it make sense to include larger phrasal units in the semantic space; or do operations over sentences, such as sentence merging or reduction, offer alternative ways of visualizing topical content.</Paragraph> <Paragraph position="4"> It is therefore worthwhile investigating whether combining automatic summarization with intelligent multimedia presentation techniques can make the briefing generation amenable to full automation. In other words, the author should be able to use a computer program to generate an initial briefing, which she can then edit and revise as needed. The briefing can then be presented by the author if desired, or else directly by the computer (particularly useful if the briefing is being sent to someone else). The starting point for this process would be a high-level outline of the briefing on the part of the author. The outline would include references to particular information sources that had to be summarized in particular ways. If a program were able to take such outlines and generate briefings which didn't require extensive post-editing to massage into a state deemed acceptable for the task at hand, the program could be regarded as a worthwhile time saving tool.</Paragraph> </Section> </Section> class="xml-element"></Paper>