File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-1011_metho.xml
Size: 7,524 bytes
Last Modified: 2025-10-06 14:09:12
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-1011"> <Title>Handling Figures in Document Summarization</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Figures as Surrogate Documents </SectionTitle> <Paragraph position="0"> Some time ago, when Lesk asked chemists what two parts of Chemistry papers would be most informative, they said they would like to know the names of the authors and to see the figures (Michael Lesk, personal communic ation).</Paragraph> <Paragraph position="1"> Recently, journals are beginning to implement approaches in this spirit. The Journal of Proteome Research lists in the table of contents, in both the print and online editions, an entry for each paper that includes the title, authors, abstract and one uncaptioned figure from the paper, typically in color. Science and Nature also inclu de some figures in their contents pages. The new open-access journal, PLoS Biology, offers five &quot;Views&quot; of a paper: HTML, Tables, Figures, Print PDF and Screen PDF. The Figures View is an HTML slide show of the figures, each including a large version of the figure, the caption and the article citation.</Paragraph> <Paragraph position="2"> Figure Views represent a new and important type of summary of entire articles, allowing the rapid browsing that such visual displays pr ovide.</Paragraph> <Paragraph position="3"> One can imagine that authors will adapt to this new mode, packing the major content of their papers into the figures and captions, reducing the need to read the full text.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Thumbnail Images are Summaries </SectionTitle> <Paragraph position="0"> Thumbnails are images that have been reduced in size and/or cropped to a smaller size. Shrinking an entire image so that it acts as a summary is an analog operation that has no parallel in text. For some images, shrinking them too much can produce an illegible result, a practice that has been roundly criticized (item 4 in Nielsen, 2003); cropped images may be useful in such cases.</Paragraph> <Paragraph position="1"> An example of cropping two very large images resulting in informative thumbnails appears in the Figure Gallery item on our site, http://diagrams.org/fig-pages/f00022.htm.</Paragraph> <Paragraph position="2"> The thumbnails are reproduced here in Figures 1 and 2.</Paragraph> <Paragraph position="3"> Figure 1:A full-scale analog extract (1% of the original) of the &quot;classic&quot; London Underground map. This is an informative summary with respect to the map style, but is only indicative of the full map.</Paragraph> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 Extraction for Summarization </SectionTitle> <Paragraph position="0"> One of the most important techniques used in text summarization, is extraction, typically the extraction of carefully chosen whole sentences. A similar approach can be used for diagram summarization, but some thought needs to be given to what the sentence-like elements in diagrams might be. It is not difficult to give examples of diagram extraction, but automating it is by extraction. From (Holtzendorff, 2004). In this case, retention of one of the two bar graphs in A, one of the four rows in B and all of C would result in a modest, indicative summary of the three-part figure. The keys at grams that appear in the Biology research literature. The extraction suggested in our caption picks one item from each of two sets of similar items to produce and indicative summary.</Paragraph> </Section> <Section position="7" start_page="0" end_page="0" type="metho"> <SectionTitle> 6 Diagram-Related Text </SectionTitle> <Paragraph position="0"> It might be argued that the most salient content of documents with figures can be found in the text; that the figures are redundant, merely &quot;illustrative&quot;. This is often not the case. There are queries to documents that cannot be answered based on the content of the text or diagrams considered separately (Futrelle & Rumshisky, 2001). In Biology it is not unusual for a caption to explain only the methods used to produce the data shown.</Paragraph> <Paragraph position="1"> The independent contribution of diagram content to a paper is often signaled by cue phrases.</Paragraph> <Paragraph position="2"> In referring to data graphs, phrases such as &quot;shows a significant difference&quot; or &quot;are similar&quot; or &quot;a pronounced effect&quot; require that the reader examine the data shown in the figure in order to understand what the phrases refer to.</Paragraph> <Paragraph position="3"> Fig. 4 (Nijhout, 2003) appeared in the popular scientific journal, American Scientist, and is more carefully explained than most. The Fig. 4 caption text illustrates some limitations of captions. For example, the phrase, &quot;The possible combinations&quot; does not spell out what combinations are possible or are illustrated. The reader must study the figure to discover that there are in fact three bolding added, was: &quot; Enzyme activity is a function of allele identity. In this example, the allele A encodes an enzyme that has three times greater activity than the enzyme encoded by allele a. The possible combinations of A and a in an individual yield a wide range of overall activity levels.&quot; The references to A and a in Fig. 4, are deictic references, pointing to objects visible in the context, in the figure. In ordinary conversation, such a reference would point to some physical object in the view of the listener.</Paragraph> <Paragraph position="4"> A summarization of Figure 4 should include the entire diagram. The last sentence of the caption would be a suitable summary of the caption.</Paragraph> <Paragraph position="5"> The non-caption text and the text within figures play important roles and need to be taken into account in any attempt to produce a summary. Space precludes further discussion of these.</Paragraph> </Section> <Section position="8" start_page="0" end_page="0" type="metho"> <SectionTitle> 7 Prospects for Automation </SectionTitle> <Paragraph position="0"> Some degree of summarization might be possible based entirely on the classes of the diagrams or subdiagrams in a paper. We have been able to locate subdiagrams in vector-based diagrams in PDFs and successfully classify them using Support Vector Machines (Futrelle, Shao, Cie slik, & Grimes, 2003).</Paragraph> <Paragraph position="1"> But any more detailed summarization decisions would require parsed representations of the diagrams. For example, our parser can discover and analyze the two bar charts in Fig. 3, allowing a system to extract only one of them, though without any knowle dge as to which is the most salient.</Paragraph> <Paragraph position="2"> The parser can also locate keys, such as the ones in Fig. 3, so they can be extracted also. Standard strategies from text summarization, such as extracting the diagrams most often referred to, diagrams appearing near the beginning and end of a paper, etc., are all possible. Clearly, automation of diagram summarization presents a new set of challenges and is no easier than text summarization. Large scale evaluation of diagram summarization will offer its own challenges, cf. text summarization evaluation (Radev et al., 2003).</Paragraph> </Section> class="xml-element"></Paper>