File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-2611_metho.xml

Size: 8,585 bytes

Last Modified: 2025-10-06 14:09:25

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-2611">
  <Title>Abstraction Summarization for Managing the Biomedical Research Literature</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Multidocument Summarization
</SectionTitle>
    <Paragraph position="0"> The MEDLINE database, developed and maintained by the N than 12 million citations (dating from the 1960's to the present) d at the same time retaining differences  drawn from nearly 4,600 journals in the biomedical domain. Access is provided by a statistical information retrieval system. Due to the size of the database, searches often retrieve large numbers of items. For example, the query &amp;quot;diabetes&amp;quot; returns 207,997 citations. Although users can restrict searches by language, date and publication type (as well as specific journals), results can still be large. For example, a query for treatment (only) for diabetes, limited to articles published in 2003 and having an abstract in English finds 3,621 items; limiting this further to articles describing clinical trials still returns 390 citations. We describe the adaptation of our abstraction summarization process to multi-document input for managing the results of searches in MEDLINE.</Paragraph>
    <Paragraph position="1"> Extending summarization to multidocument input presents challenges in removing redundancies across documents an might be important. One issue is devising a framework on which to compute similarities and differences across documents. Radev (2000) defines twenty-four relationships (such as equivalence, subsumption, and contradiction) that might apply at various structural levels across documents. Sub-events (Daniel et al., 2003) and sub-topics (Saggion and Lapalme, 2002) also contribute to the framework used for comparing documents in multidocument summarization.</Paragraph>
    <Paragraph position="2"> A particular challenge to multidocument summarization in the extraction paradigm is determining what parts of documents conform to the f ining similarities and differences. A recent study (Kan et al., 2001) uses topic composition from text headers, but other studies in the extraction paradigm (Goldstein et al., 1999), extraction coupled with rhetorical structural identification (Teufel and Moens, 2002), and syntactic abstraction paradigms use different methodologies (Barzilay et al., 1999; McKeown et al., 1999). Our semantic abstraction summarization system naturally extends to multidocument input with no modification from the system designed for single documents. e disorder schema serves as the framework for identifying sub-topics, and predications retrieved across several documents must conform to its structure.</Paragraph>
    <Paragraph position="3"> Informational equivalence (and redundancy) is computed on this basis. For example, all predications that conform to the schema line {Treatment} TREATS {Disorders} constitute a representation of a subtopic in the disorder domain. Exact matches in this set constitute redundant information, and other types of relationships can be computed on the basis of partial matches. Although we concentrate on similarities across documents, differences could be computed by examining predications that are not shared among citations.</Paragraph>
    <Paragraph position="4"> We have begun testing our system applied to the results of MEDLINE searches on disorders, concentrating on the most recent 300 citations retrieved. The migraine are represented graphically in Figure 3.</Paragraph>
    <Paragraph position="5"> Traversing the predicates (arcs) in this condensate provides an informative summary of these citations.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Evaluation and Results
</SectionTitle>
    <Paragraph position="0"> uation in automatic summariz idocument input, is daunting (Rad of the summary as related to the source documents) or extrinsic (how the summary affects some other task). Since we do not have a gold standard to compare the final condensates against, we performed a linguistic evaluation on the quality of the condensates generated for four diseases: migraine, angina pectoris, Crohn's disease, and pneumonia. The input for each summary was 300 MEDLINE citations.</Paragraph>
    <Paragraph position="1"> Table 1 presents evaluation results. The first author (MF) examined the source sentence that SemRep used to generate each predication a s as either correct or incorrect. Precision was calculated as the total number of correct predications divided by the total number of predications in the condensate. We also measured the reduction (compression) for each of the four disorder concepts. In Table 1, &amp;quot;Base&amp;quot; is the number of predications SemRep produced from eac of 300 citations. &amp;quot;Final&amp;quot; is the number of predications left after the final transformation. Therefore, this is a compression ratio on the semantic space of predications, and is different from text compression in the traditional sense.</Paragraph>
    <Paragraph position="3"> In Crohn's disease (with lowest precision) a single Sem for 52% of th ocessing the sen Resul for our sea concepts C = Co rect, I Incor ect Rep error type in argument identification accounts e mistakes. For example in pr tence 36 patients with inflammatory bowel disease (11 with ulcerative colitis and 25 with Crohn's disease), the parenthesized material caused SemRep to incorrectly returned &amp;quot;Inflammatory Bowel Diseases CO-OCCURS_WITH Ulcerative Colitis&amp;quot; and &amp;quot;Ulcerative Colitis predicate CO-OCCURS_WITH Crohn's Disease.&amp;quot; Word sense ambiguity also contributed to a large number of errors.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
6 Content Characterization
</SectionTitle>
    <Paragraph position="0"> We sformation stage has and predications during the summarization process. SemRep produced S_IN; and pes in the final con he fin nitially parsed, only 63 are represented in the o far do not accommodate. Some of the lusion and Future Directions We raction summarization that produces conceptual condensates for condensate to the text that produced them. We examined the effect that the tran on the distribution of predicates 2,485 predications from 300 citations retrieved for migraine. Of these, 1,638 are distributed over four predicates in the disorder schema (327-TREATS; 148-ISA; 180-LOCATION_OF; 54-CAUSES; 720-OCCURS_IN; and 209-CO-OCCURS_WITH). After phases 1, 2, and 3 of the transformation process, 311 predications remain (134-TREATS; 41-ISA; 12-LOCATION_OF; 5-CAUSES; 68-OCCUR 51-CO-OCCURS_WITH). This reduction is largely due to hierarchical pruning in phase 3. Phase 4 operations, based on frequency of occurrence pruning (saliency), further condensed the list, and the top three TREATS predication ty densate are (13-Sumatriptan TREATS Migraine; 6-Botulinum Toxins TREATS Migraine; and 6-feverfew extract TREATS Migraine). This list represents the fact that Sumatriptan is a popular treatment for migraine. Besides frequency, another way of looking at the predications is typicality (Kan et al., 2001), or distribution of predications across citations. Looking at t al condensate for migraine and focusing on TREATS, the most widely distributed predications are &amp;quot;Sumatriptan TREATS Migraine,&amp;quot; which occurs in ten citations; &amp;quot;Botulinum Toxins TREATS Migraine&amp;quot; (three citations); and &amp;quot;feverfew extract TREATS Migraine&amp;quot; (two citations).</Paragraph>
    <Paragraph position="1"> One can also view the final condensate from the perspective of citations, rather than predications. Of the 300 citations i final condensate, one with six predications, one with five predications, three with four predications, and so on. It is tempting to hypothesize that more highly relevant citations will have produced more predications, but this must be formally tested in the context of the user's retrieval objective.</Paragraph>
    <Paragraph position="2"> An informal examination of the citations that contributed to the final condensate for migraine revealed differences that we s se, such as publication and study type, could be addressed outside of natural language processing with MEDLINE metadata. Others, including medication delivery system and target population of the disorder topic, are amenable to current processing either through extension of the disease schema or enhancements to SemRep.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML