File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-0701_metho.xml
Size: 15,596 bytes
Last Modified: 2025-10-06 14:10:35
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-0701"> <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics Dimensionality Reduction Aids Term Co-Occurrence Based Multi-Document Summarization</Title> <Section position="4" start_page="0" end_page="1" type="metho"> <SectionTitle> 2 Representing Sentence Semantics </SectionTitle> <Paragraph position="0"> The following three subsections discuss various ways of representing sentence meaning for information extraction purposes. While the first approach relies solely on weighted term frequencies in a vector space, the subsequent methods attempt to use term context information to better represent the meanings of sentences.</Paragraph> <Section position="1" start_page="1" end_page="1" type="sub_section"> <SectionTitle> 2.1 Terms and Term Weighting (TF.IDF) </SectionTitle> <Paragraph position="0"> The traditional model for measuring semantic similarity in information retrieval and text mining is based on a vector representation of the distribution of terms in documents. Within the vector space model, each term is assigned a weight which signifies the semantic importance of the term. Often, tf.idf is used for this weight, which is a scheme that combines the importance of a term within the current document3 and the distribution of the term across the text collection. The former is often represented by the term frequency and the latter by the inverse document frequency (idfi = Ndfi ), where N is the number of documents and dfi is the number of documents containing term ti.</Paragraph> </Section> <Section position="2" start_page="1" end_page="1" type="sub_section"> <SectionTitle> 2.2 Term Co-occurrence (DS) </SectionTitle> <Paragraph position="0"> Another approach eschews the traditional vector space model in favour of the distributional semantics approach. The DS model is based on the intuition that two words are semantically similar if they appear in a similar set of contexts. We can obtain a representation of a document's semantics by averaging the context vectors of the document terms. (See Besanc,on et al. (1999), where the DS model is contrasted with a term x document vector space representation.)</Paragraph> </Section> <Section position="3" start_page="1" end_page="1" type="sub_section"> <SectionTitle> 2.3 Singular Value Decomposition </SectionTitle> <Paragraph position="0"> (DS+SVD) Our third approach uses dimensionality reduction. Singular value decomposition is a technique for dimensionality reduction that has been used extensively for the analysis of lexical semantics under the name of latent semantic analysis (Landauer et al., 1998). Here, a rectangular (e.g., term x document) matrix is decomposed into the product of three matrices (Xwxp = WwxnSnxn(Ppxn)T) with n 'latent semantic' dimensions. W and P represent terms and documents in the new space.</Paragraph> <Paragraph position="1"> And S is a diagonal matrix of singular values in decreasing order.</Paragraph> <Paragraph position="2"> Taking the product WwxkSkxk(Ppxk)T over the first k columns gives the best least square approximation of the original matrix X by a matrix of rank k, i.e. a reduction of the original matrix to k dimensions. Similarity between documents can then be computed in the space obtained by taking the rank k product of S and P.</Paragraph> <Paragraph position="3"> sation or the context of an entity pair in relation discovery. This decomposition abstracts away from terms and can be used to model a semantic similarity that is more linguistic in nature. Furthermore, it has been successfully used to model human intuitions about meaning. For example, Landauer et al. (1998) show that latent semantic analysis correlates well with human judgements of word similarity and Foltz (1998) shows that it is a good estimator for textual coherence.</Paragraph> <Paragraph position="4"> It is hoped that these latter two techniques (dimensionality reduction and the DS model) will provide for a more robust representation of term contexts and therefore better representation of sentence meaning, enabling us to achieve more reliable sentence similarity measurements for extractive summarisation.</Paragraph> </Section> </Section> <Section position="5" start_page="1" end_page="2" type="metho"> <SectionTitle> 3 SVD in Summarisation </SectionTitle> <Paragraph position="0"> This section describes ways in which SVD has been used for summarisation and details the implementation in the Embra system.</Paragraph> <Section position="1" start_page="1" end_page="2" type="sub_section"> <SectionTitle> 3.1 Related Work </SectionTitle> <Paragraph position="0"> In seminal work by Gong and Liu (2001), the authors proposed that the rows of PT may be regarded as defining topics, with the columns representing sentences from the document. In their SVD method, summarisation proceeds by choosing, for each row in PT, the sentence with the highest value. This process continues until the desired summary length is reached.</Paragraph> <Paragraph position="1"> Steinberger and JeVzek (2004) have offered two criticisms of the Gong and Liu approach. Firstly, the method described above ties the dimensionality reduction to the desired summary length. Secondly, a sentence may score highly but never &quot;win&quot; in any dimension, and thus will not be extracted despite being a good candidate. Their solution is to assign each sentence an SVD-based score using:</Paragraph> <Paragraph position="3"> where v(i,k) is the kth element of the ith sentence vector and s(k) is the corresponding singular value.</Paragraph> <Paragraph position="4"> Murray et al. (2005a) address the same concerns but retain the Gong and Liu framework. Rather than extracting the best sentence for each topic, the n best sentences are extracted, with n determined by the corresponding singular values from matrix S. Thus, dimensionality reduction is no longer tied to summary length and more than one sentence per topic can be chosen.</Paragraph> <Paragraph position="5"> A similar approach in DUC 2005 using term co-occurrence models and SVD was presented by Jagarlamudi et al. (2005). Their system performs SVD over a term x sentence matrix and combines a relevance measurement based on this representation with relevance based on a term co-occurrence model by a weighted linear combination.</Paragraph> </Section> <Section position="2" start_page="2" end_page="2" type="sub_section"> <SectionTitle> 3.2 Sentence Selection in Embra </SectionTitle> <Paragraph position="0"> The Embra system developed for DUC 2005 attempts to derive more robust representations of sentences by building a large semantic space using SVD on a very large corpus. While researchers have used such large semantic spaces to aid in automatically judging the coherence of documents (Foltz et al., 1998; Barzilay and Lapata, 2005), to our knowledge this is a novel technique in summarisation. null Using a concatenation of Aquaint and DUC 2005 data (100+ million words), we utilised the Infomap tool4 to build a semantic model based on singular value decomposition (SVD). The decomposition and projection of the matrix to a lowerdimensionality space results in a semantic model based on underlying term relations. In the current experiments, we set dimension of the reduced representation to 100. This is a reduction of 90% from the full dimensionality of 1000 content-bearing terms in the original DS matrix. This was found to perform better than 25, 50, 250 and 500 during parameter optimisation. A given sentence is represented as a vector which is the average of its constituent word vectors. This sentence representation is then fed into an MMR-style algorithm. MMR (Maximal Marginal Relevance) is a common approach for determining relevance and redundancy in multi-document summarisation, in which candidate sentences are represented as weighted term-frequency vectors which can thus be compared to query vectors to gauge similarity and already-extracted sentence vectors to gauge redundancy, via the cosine of the vector pairs (Carbonell and Goldstein, 1998). While this has proved successful to a degree, the sentences are represented merely according to weighted term frequency in the document, and so two similar sentences stand a chance of not being considered sim- null for each sentence in document: for each word in sentence: get word vector from semantic model average word vectors to form sentence vector</Paragraph> <Paragraph position="2"> ilar if they do not share the same terms.</Paragraph> <Paragraph position="3"> Our implementation of MMR (Figure 1) uses l annealing following (Murray et al., 2005a). l decreases as the summary length increases, thereby emphasising relevance at the outset but increasingly prioritising redundancy removal as the process continues.</Paragraph> </Section> </Section> <Section position="6" start_page="2" end_page="3" type="metho"> <SectionTitle> 4 Experiment </SectionTitle> <Paragraph position="0"> The experimental setup uses the DUC 2005 data (Dang, 2005) and the Rouge evaluation metric to explore the hypothesis that query-oriented multi-document summarisation using a term co-occurrence representation can be improved using SVD. We frame the research question as follows: Does SVD dimensionality reduction lead to an increase in Rouge score compared to the DS representation?</Paragraph> <Section position="1" start_page="2" end_page="3" type="sub_section"> <SectionTitle> 4.1 Materials </SectionTitle> <Paragraph position="0"> The DUC 2005 task5 was motivated by Amigo et al.'s (2004) suggestion of evaluations that model real-world complex question answering. The goal is to synthesise a well-organised, fluent answer of no more than 250 words to a complex question from a set of 25 to 50 relevant documents. The data includes a detailed query, a document set, and at least 4 human summaries for each of 50 topics.</Paragraph> <Paragraph position="1"> The preprocessing was largely based on LT TTT and LT XML tools (Grover et al., 2000; Thompson et al., 1997). First, we perform tokenisation and sentence identification. This is followed by lemmatisation.</Paragraph> <Paragraph position="2"> At the core of preprocessing is the LT TTT program fsgmatch, a general purpose transducer which processes an input stream and adds annotations using rules provided in a hand-written grammar file. We also use the statistical combined part-of-speech (POS) tagger and sentence boundary disambiguation module from LT TTT (Mikheev, 1997). Using these tools, we produce an XML markup with sentence and word elements. Further linguistic markup is added using the morpha lemmatiser (Minnen et al., 2000) and the C&C named entity tagger (Curran and Clark, 2003) trained on the data from MUC-7.</Paragraph> </Section> <Section position="2" start_page="3" end_page="3" type="sub_section"> <SectionTitle> 4.2 Methods </SectionTitle> <Paragraph position="0"> The different system configurations (DS, DS+SVD, TF.IDF) were evaluated against the human upper bound and a baseline using Rouge-2 and Rouge-SU4. Rouge estimates the coverage of appropriate concepts (Lin and Hovy, 2003) in a summary by comparing it several human-created reference summaries. Rouge-2 does so by computing precision and recall based on macro-averaged bigram overlap. Rouge-SU4 allows bigrams to be composed of non-contiguous words, with as many as four words intervening.</Paragraph> <Paragraph position="1"> We use the same configuration as the official DUC 2005 evaluation,6 which is based on word stems (rather than full forms) and uses jackknifing (k[?]1 cross-evaluation) so that human gold-standard and automatic system summaries can be compared.</Paragraph> <Paragraph position="2"> The independent variable in the experiment is the model of sentence semantics used by the sentence selection algorithm. We are primarily interested in the relative performance of the DS and DS+SVD representations. As well as this, we include the DUC 2005 baseline, which is a lead summary created by taking the first 250 words of the most recent document for each topic. We also include a tf.idf -weighted term x sentence representation (TF.IDF) for comparison with a conventional MMR approach.7 Finally, we include an upper bound calculated using the DUC 2005 human reference summaries. Preprocessing and all other aspects of the sentence selection algorithm remain constant over all systems.</Paragraph> <Paragraph position="3"> In general, Rouge shows a large variance across data sets (and so does system performance). It is important to test whether obtained nominal differences are due to chance or are actually statistically significant.</Paragraph> <Paragraph position="4"> To test whether the Rouge metric showed a reliably different performance for the systems, the 6i.e. ROUGE-1.5.5.pl -n 2 -x -m -2 4 -u -c 95 -r 1000 -f A -p 0.5 -t 0 d 7Specifically, we use tfi,j [?] log( N dfi ) for term weighting where tfi,j is the number of times term i occurs in sentence j, N is the number of sentences, and dfi is the number of sentences containing term i.</Paragraph> <Paragraph position="6"> results.</Paragraph> <Paragraph position="7"> Friedman rank sum test (Friedman, 1940; DemVsar, 2006) can be used. This is a hypothesis test not unlike an ANOVA, however, it is non-parametric, i.e. it does not assume a normal distribution of the measures (i.e. precision, recall and F-score). More importantly, it does not require homogeneity of variances.</Paragraph> <Paragraph position="8"> To (partially) rank the systems against each other, we used a cascade of Wilcoxon signed ranks tests. These tests are again non-parametric (as they rank the differences between the system results for the datasets). As discussed by DemVsar (2006), we used Holm's procedure for multiple tests to correct our error estimates (p).</Paragraph> </Section> <Section position="3" start_page="3" end_page="3" type="sub_section"> <SectionTitle> 4.3 Results </SectionTitle> <Paragraph position="0"> Friedman tests for each Rouge metric (with F-score, precision and recall included as observations, with the dataset as group) showed a reliable effect of the system configuration</Paragraph> <Paragraph position="2"> Post-hoc analysis (Wilcoxon) showed (see Table 1) that all three systems performed reliably better than the baseline. TF.IDF performed better than simple DS in Rouge-2 and Rouge-SU4.</Paragraph> <Paragraph position="3"> DS+SVD performed better than DS (p2 < 0.05, pSU4 < 0.005). There is no evidence to support a claim that DS+SVD performed differently from TF.IDF.</Paragraph> <Paragraph position="4"> However, when we specifically compared the performance of TF.IDF and DS+SVD with the Rouge-SU4 F score for only the specific (as opposed to general) summaries, we found that (Wilcoxon, p<0.05). This result is unadjusted, and post-hoc comparisons with other scores or for the general summaries did not show reliable differences. null Having established the reliable performance improvement of DS+SVD over DS, it it important to take the effect size into consideration (with enough data, small effects may be statistically significant, but practically unimportant). Figure 2 illustrates that the gain in mean performance is substantial. If the mean Rouge-SU4 score for human performance is seen as upper bound, the DS+SVD system showed a 25.4 percent reduction in error compared to the DS system.8 A similar analysis for precision and recall gives qualitatively comparable results.</Paragraph> </Section> </Section> class="xml-element"></Paper>