File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/w00-1110_metho.xml
Size: 8,113 bytes
Last Modified: 2025-10-06 14:07:29
<?xml version="1.0" standalone="yes"?> <Paper uid="W00-1110"> <Title>Automatic summarization of search engine hit lists</Title> <Section position="4" start_page="100" end_page="100" type="metho"> <SectionTitle> 3 Clustering </SectionTitle> <Paragraph position="0"> Our system uses two types of clustered inputeither the set of hits that the user has selected or the output of our own clustering engine -</Paragraph> </Section> <Section position="5" start_page="100" end_page="100" type="metho"> <SectionTitle> CIDR (Columbia Intelligent Document </SectionTitle> <Paragraph position="0"> Relater). CIDR is described in (Radev et al., 1999). It uses an iterative algorithm that creates as a side product so-called &quot;document centroids&quot;. The centroids contain the most highly relevant words to the entire cluster (not to the user query). We use these words to find the most salient &quot;themes&quot; in the cluster of documents.</Paragraph> <Section position="1" start_page="100" end_page="100" type="sub_section"> <SectionTitle> 3.1 Finding themes within clusters </SectionTitle> <Paragraph position="0"> One of the underlying assumptions behind SNS is that when a user selects a set of hits after reading the single-document summaries from the hit list retrieved by the system, he or she performs a cognitive activity whereby he or she selects documents which appear to be related to one or more common themes. The multi-document summarization algorithm attempts to identify these themes and to identify the most salient passages from the selected documents using a pseudo-document called the cluster centroid which is computed automatically from the entire list of hits selected by the user.</Paragraph> </Section> <Section position="2" start_page="100" end_page="100" type="sub_section"> <SectionTitle> 3.2 Computing centroids </SectionTitle> <Paragraph position="0"> Figure 2 describes a sample of a cluster centroid. The TF column indicates the average term frequency of a given term within the cluster. E.g., a TF value of 13.33 for three documents indicates that the term &quot;'deny&quot; appears 40 times in the three documents. The IDF values are computed from a mixture of</Paragraph> </Section> </Section> <Section position="6" start_page="100" end_page="101" type="metho"> <SectionTitle> 4 Centroid-based summarization </SectionTitle> <Paragraph position="0"> The main technique that we use for summarization is sentence extraction. We score individually each sentence within a cluster and output these that score the highest. A more detailed description of the summarizer can be found in (Radev et al., 2000).</Paragraph> <Paragraph position="1"> The input to the summarization component is a cluster of documents. These documents can be either the result of a user query or the output of CIDR.</Paragraph> <Paragraph position="2"> The summarizer takes as input a cluster old documents with a total of n sentences as well as a compression ratio parameter r which indicates how much of the original cluster to preserve.</Paragraph> <Paragraph position="3"> The output consists of a sequence of In * r\] sentences from the original documents in the same order as the input documents. The highest-ranking sentences are included according to the scoring formula below:</Paragraph> <Paragraph position="5"> In the formula, we, wp, wf are weights. Ci is the centroid score of the sentence, P~ is the positional score of the sentence, and F~ is the score of the sentence according to the overlap with the first sentence of the document.</Paragraph> <Section position="1" start_page="101" end_page="101" type="sub_section"> <SectionTitle> 4.1 Centroid value </SectionTitle> <Paragraph position="0"> The centroid value C~ for sentence Si is computed as the sum of the centroid values Cw of all words in the sentence. For example, the sentence &quot;President Clinton met with Vernon Jordon in January&quot; gets a score of 243.34 which is the sum of the individual eentroid values of the words (clinton = 36.39; vernon =</Paragraph> <Paragraph position="2"/> </Section> <Section position="2" start_page="101" end_page="101" type="sub_section"> <SectionTitle> 4.2 Positional value </SectionTitle> <Paragraph position="0"> The positional value is computed as follows: the first sentence in a document gets the same score Cm,~, as the highest-ranking sentence in the document according to the centroid value.</Paragraph> <Paragraph position="1"> The score for all sentences within a document is computed according to the following formula:</Paragraph> <Paragraph position="3"> For example, if the sentence described above appears as the third sentence out of 30 in a document and the largest centroid value of any sentence in the given document is 917.31, the positional value P3 will be = 28/30 * 917.31</Paragraph> </Section> <Section position="3" start_page="101" end_page="101" type="sub_section"> <SectionTitle> 4.3 First-sentence overlap </SectionTitle> <Paragraph position="0"> The overlap value is computed as the inner product of the sentence vectors for the current sentence i and the first sentence of the document. The sentence vectors are the n-dimensional representations of the words in each sentence whereby the value at position i of a sentence vector indicates the number of occurrences of that word in the sentence.</Paragraph> <Paragraph position="2"/> </Section> <Section position="4" start_page="101" end_page="101" type="sub_section"> <SectionTitle> 4.4 Combining the three parameters </SectionTitle> <Paragraph position="0"> As indicated in (Radev & al., 2000) we have experimented with several weighting schemes for the three parameters (centroid, position, and first-sentence overlap). Until this moment, we have not come to the point in which the three weights we, wp, and wf are either automatically learned or derived from a user profile. Instead, we have experimented with various sets of empirically determined values for the weights. In this paper the results are based on equal weights for the three parameters wc = wp = wf= 1.</Paragraph> </Section> </Section> <Section position="7" start_page="101" end_page="105" type="metho"> <SectionTitle> 5 User Interface </SectionTitle> <Paragraph position="0"> We describe in this section the user interface for web search mode as described earlier in Section 1.</Paragraph> <Paragraph position="1"> One component of our system is the search engine (MySearch). The detailed design of the search component is discussed in Section 2. The result of a sample query &quot;'Clinton&quot; to our search engine is shown starting in Figure 4.</Paragraph> <Paragraph position="2"> A user has the option to choose a specific ranking function as well as the number of retrieval results to be shown in a single screen. The keyword contained in the query string will be automatically highlighted in the search results to provide contextual information for the user.</Paragraph> <Paragraph position="3"> The overall interface for SNS is shown in MySearch search engine. When a user submits a query, the screen in Figure 5 appears. As can be seen from Figure 5, there is a check box along with each retrieved record. This allows the user to tell the summarization engine which documents he/she wants to summarize. After the user clicks the summarization button, the summarization option screen is displayed as shown in bottom of Figure 6. The summarization option screen allows a user to specify the summarization compression ratio. Figure 7 shows the summarization result for four URLs with the compression ratio set as 30%.</Paragraph> <Paragraph position="4"> The following information is shown in the summarization result screen in Figure 7: * The number of sentences in the text of the set of URLs that the user selected * The number of sentences in the summary The sentences representing the themes of those selected URLs and their relative scores. The sentences are ordered the same way they appear in the original set of documents.</Paragraph> </Section> class="xml-element"></Paper>