File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/w04-2611_intro.xml

Size: 17,397 bytes

Last Modified: 2025-10-06 14:02:43

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-2611">
  <Title>Abstraction Summarization for Managing the Biomedical Research Literature</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Several approaches to text-based information management applications are being pursued, including word-based statistical processing and those depending on string matching, syntax, or semantics. Statistical systems have enjoyed considerable success for information retrieval, especially using the vector space model (Salton et al., 1975). Since the SIR system (Raphael, 1968), some have felt that automatic information management could best be addressed using semantic information.</Paragraph>
    <Paragraph position="1"> Subsequent research (Schank, 1975; Wilks, 1976) expanded this paradigm. More recently, a number of examples of knowledge-based applications show considerable promise. These include systems for machine translation (Viegas et al., 1998), question answering, (Harabagiu et al., 2001; Clark et al., 2003), and information retrieval (Mihalcea and Moldovan, 2000).</Paragraph>
    <Paragraph position="2"> In the biomedical domain, the MEDLINE (r)  bibliographic database provides opportunities for keeping abreast of the research literature. However, the large size of this online resource presents potential challenges to the user. Query results often include hundreds or thousands of citations (including title and abstract).</Paragraph>
    <Paragraph position="3"> Automatic summarization offers potential help in managing such results; however, the most popular approach, extraction, faces challenges when applied to multi-document summarization (McKeown et al., 2001).</Paragraph>
    <Paragraph position="4"> Abstraction summarization offers an attractive alternative for managing citations resulting from MEDLINE searches. We present a knowledge-rich abstraction approach that depends on underspecified semantic interpretation of biomedical text. As an example, a graphical representation (Batagelj, 2003) of the semantic predications serving as a summary (or conceptual condensate) from our system is shown in Figure 1. The input text was a MEDLINE citation with title &amp;quot;Gastrointestinal tolerability and effectiveness of rofecoxib versus naproxen in the treatment of osteoarthritis: a randomized, controlled trial.&amp;quot;  Our semantic interpreter and the abstraction summarizer based on it both draw on semantic information  from the Unified Medical Language System (r) (UMLS), (r)  a resource for structured knowledge in the biomedical domain. After introducing the semantic interpreter, we describe the transformation phase of our paradigm, discussing principles that depend on semantic notions in order to condense the semantic predications representing the content of text. Initially, this process was applied to summarizing single documents. We discuss its adaptation to multidocument input, specifically to the set of citations resulting from a query to the MEDLINE database. Although we have not yet formally evaluated the effectiveness of the resulting condensate, we discuss its characteristics and possibilities as both an indicative and informative summary.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2
Background
Lexical Semantics
</SectionTitle>
      <Paragraph position="0"> Research in lexical semantics (Cruse, 1986) provides insight into the interaction of reference and linguistic structure. In addition to paradigmatic lexical phenomena such as synonymy, hypernymy, and meronymy, diathesis alternation (Levin and Rappaport Hovav, 1996), deep case (Fillmore, 1968), and the interaction of predicational structure and events (Tenny and Pustejovsky, 2000) are being investigated. Some of the consequences of research in lexical semantics, with particular attention to natural language processing, are discussed by Pustejovsky et al. (1993) and Nirenburg and Raskin (1996).</Paragraph>
      <Paragraph position="1"> Implemented systems often draw on the information contained in WordNet (Fellbaum, 1998).</Paragraph>
      <Paragraph position="2"> In the biomedical domain, UMLS knowledge provides considerable support for text-based systems.</Paragraph>
      <Paragraph position="3"> (Burgun and Bodenreider (2001) compare the UMLS to WordNet.) The UMLS (Humphreys et al., 1998) consists of three components: the Metathesaurus, (r)  Semantic Network (McCray, 1993), and SPECIALIST Lexicon (McCray et al., 1994). The Metathesaurus is at the core and contains more than 900,000 concepts compiled from more than sixty controlled vocabularies. Many of these have hierarchical structure, and some contain meronymic information in addition to hypernymy. Editors combine terms in the constituent vocabularies into a set of synonyms (cf. WordNet's synsets), which constitutes a concept. One term in this set is called the &amp;quot;preferred name&amp;quot; and is used as the concept name, as shown in (1).</Paragraph>
      <Paragraph position="4">  (1) Concept: Dyspnea Synonyms: Breathlessness, Shortness of breath, Breathless, Difficulty breathing, Respiration difficulty, etc.</Paragraph>
      <Paragraph position="5">  In addition, each concept in the Metathesaurus is assigned at least one semantic type (such as 'Sign or Symptom' for (1)), which categorizes the concept in the biomedical domain. The semantic types available are drawn from the Semantic Network, in which they are organized hierarchically in two single-inheritance trees, one under the root 'Entity' and another under 'Event'. The Semantic Network also contains semantic predications with semantic types as arguments. The predicates are semantic relations relevant to the biomedical domain and are organized as subtypes of five classes, such as TEMPORALLY_RELATED_TO and FUNCTIONALLY_RELATED_TO. Examples are shown in (2).</Paragraph>
      <Paragraph position="6"> (2) 'Pharmacologic Substance' TREATS 'Disease or Syndrome', 'Virus' CAUSES 'Disease or Syndrome' Lexical semantic information in the UMLS is distributed between the Metathesaurus and the Semantic Network. The Semantic Network stipulates permissible argument categories for classes of semantic predications, although it does not refer to deep case relations. The Metathesaurus encodes synonymy, hypernymy, and meronymy (especially for human anatomy). Synonymy is represented by including synonymous terms under a single concept. Word sense ambiguity is represented to some extent in the Metathesaurus. For example discharge is represented by the two concepts in (3), with different semantic types.</Paragraph>
      <Paragraph position="7">  (3) Discharge, Body Substance: 'Body Substance' Patient Discharge: 'Health Care Activity'  The SPECIALIST Lexicon contains orthographic information (such as spelling variants) and syntactic information, including inflections for nouns and verbs and sub-categorization for verbs. A suite of lexical access tools accommodate other phenomena, including derivational variation.</Paragraph>
      <Paragraph position="8"> SemRep Our summarization system relies on semantic predications provided by SemRep (Rindflesch and Fiszman, 2003), a program that draws on UMLS information to provide underspecified semantic interpretation in the biomedical domain (Srinivasan and Rindflesch, 2002; Rindflesch et al., 2000). Semantic interpretation is based on a categorical analysis that is underspecified in that it is a partial parse (cf. McDonald, 1992). This analysis depends on the SPECIALIST Lexicon and the Xerox part-of-speech tagger (Cutting et al., 1992) and provides simple noun phrases that are mapped to concepts in the UMLS Metathesaurus using MetaMap (Aronson, 2001).</Paragraph>
      <Paragraph position="9"> The categorial analysis enhanced with Metathesaurus concepts and associated semantic types provides the basis for semantic interpretation, which relies on two components: a set of &amp;quot;indicator&amp;quot; rules and an (underspecified) dependency grammar. Indicator rules map between syntactic phenomena (such as verbs, nominalizations, and prepositions) and predicates in the Semantic Network. For example, such rules stipulate that the preposition for indicates the semantic predicate TREATS in sumatriptan for migraine. The application of an indicator rule satisfies the first of several necessary conditions for the interpretation of a semantic predication. Argument identification is controlled by a partial dependency grammar. As is common in such grammars, a general principle disallows intercalated dependencies (crossing lines). Further, a noun phrase may not be used as an argument in the interpretation of more than one semantic predication, without license. (Coordination and relativization license noun phrase reuse.) A final principle states that if a rule can apply it must apply. Semantic interpretation in SemRep is not based on the &amp;quot;real&amp;quot; syntactic structure of the sentence; however linear order of the components of the partial parse is crucial. Argument identification rules are articulated for each indicator in terms of surface subject and object. For example, subjects of verbs are to the left and objects are to the right. (Passivization is accommodated before final interpretation.) There are also rules for prepositions and several rules for arguments of nominalizations. null The final condition on the interpretation of an associative semantic predication is that it must conform to the appropriate relationship in the Semantic Network.</Paragraph>
      <Paragraph position="10"> For example, if a predication is being constructed on the basis of an indicator rule for TREATS, the syntactic arguments identified by the dependency grammar must have been mapped to Metathesaurus concepts with semantic types that conform to the semantic arguments of TREATS in the Semantic Network, such as 'Pharmacologic Substance' and 'Disease or Syndrome'. Hypernymic propositions are further controlled by hierarchical information in the Metathesaurus (Rindflesch and Fiszman, 2003).</Paragraph>
      <Paragraph position="11"> In processing the sentence in (4), SemRep first constructs the partial categorical representation given schematically in (5). This is enhanced with semantic information from the Metathesaurus as shown in (6), where the corresponding concept for each relevant noun phrase is shown, along with its semantic type. The final semantic interpretation for (4) is given in (7).</Paragraph>
      <Paragraph position="12">  (4) Mycoplasma pneumonia is an infection of the lung caused by Mycoplasma pneumoniae (5) [[Mycoplasma pneumonia] [is] [an infection] [of the lung] [caused] [by Mycoplasma pneumoniae]] null (6) &amp;quot;Mycoplasma pneumonia&amp;quot;-'Disease or Syn-</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Automatic Summarization
</SectionTitle>
      <Paragraph position="0"> Automatic summarization is &amp;quot;a reductive transformation of source text to summary text through content reduction, selection, and/or generalization on what is important in the source&amp;quot; (Sparck Jones, 1999). Two paradigms are being pursued: extraction and abstraction (Hahn and Mani, 2000). Extraction concentrates on creating a summary from the actual text occurring in the source document, relying on notions such as frequency of occurrence and cue phrases to identify important information. null Abstraction, on the other hand, relies either on linguistic processing followed by structural compaction (Mani et al., 1999) or on interpretation of the source text into a semantic representation, which is then condensed to retain only the most important information asserted in the source. The semantic abstraction paradigm is attractive due to its ability to manipulate information that may not have been explicitly articulated in the source document. However, due to the challenges in providing semantic representation, semantic abstraction has not been widely pursued, although the TOPIC system (Hahn and Reimer, 1999) is a notable exception.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Semantic Abstraction Summarization
</SectionTitle>
      <Paragraph position="0"> We are devising an approach to automatic summarization in the semantic abstraction paradigm, relying on SemRep for semantic interpretation of source text. The transformation stage that condenses these predications is guided by principles articulated in terms of frequency of occurrence as well as lexical semantic phenomena.</Paragraph>
      <Paragraph position="1"> We do not produce a textual summary; instead, we present the disorder condensates in graphical format.</Paragraph>
      <Paragraph position="2"> We first discuss the application of this approach to summarizing single documents (full text research articles on treatment of disease) and then consider its extension to multidocument input in the form of biomedical scientific abstracts directed at clinical researchers. The transformation stage takes as input a list of</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Sem
3.2 Transformation
</SectionTitle>
      <Paragraph position="0"> In the semantic abstraction paradigm the transformation  b. Connectivity: Also include &amp;quot;useful&amp;quot; additional c. Novelty: Do not include predications that the d. Saliency: Only include the most frequently oc-</Paragraph>
    </Section>
    <Section position="5" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
and
</SectionTitle>
      <Paragraph position="0"> n these principles are applied to the semantic pre Rep predications and a seed disorder concept. The output is a conceptual condensate for the input concept. Before transformation begins, predications are subjected to a focused word sense disambiguation filter. Branded drug names such as Advantage (Advantage brand of Imidacloprid) and Direct (Direct type of resin cement), which are ambiguous with the more common meaning of their names, are resolved to their non-pharmaceutical sense.</Paragraph>
      <Paragraph position="1"> stage condenses and generalizes, and in our approach these processes are based on four general principles: a. Relevance: Include predications on the topic of the summary predications user already knows curring predications ough frequency of occ e in determining predications to be included in the summary, the other three principles depend crucially on lexical semantic information from the UMLS. These four principles guide the phases involved in creating a summary.</Paragraph>
      <Paragraph position="2"> Phase 1 predications on a given topic (in this study, disorders) and is controlled by a semantic schema (Jacquelinet et al., 2003) for that topic. The schema is represented as a set of predications in which the predicate is drawn from a relation in the UMLS Semantic Network and the arguments are represented as a &amp;quot;domain&amp;quot; covering a class of concepts in the Metathesaurus (Disorders, for example).</Paragraph>
      <Paragraph position="3">  Each domain for the schema is defined in term antic categorization in the Semantic Network. For example {Disorders} is a subset of the semantic group Disorders (McCray et al., 2001) and contains the following semantic types: 'Disease or Syndrome', 'Neoplastic Process', 'Mental or Behavioral Dysfunction', and 'Sign or Symptom'. Although the schema is not complete, it represents a substantial amount of what can be said about disorders. Predications produced by SemRep must conform to this schema in order to be included in the conceptual condensate; such predications are called &amp;quot;core predications.&amp;quot; Phase 2 (connectivity) is identifies predications occurring in neighboring semantic space of the core. This is accomplished by retrieving all the predications that share an argument with one of the core predications. For example, from Naproxen TREATS Osteoarthritis, non-core predications such as Naproxen ISA NSAID are included in the condensate.</Paragraph>
      <Paragraph position="4"> Phase 3 (n inating predications that have a generic argument, as determined by hierarchical depth in the Metathesaurus. Arguments occurring less than an empirically determined distance from the root are considered too general to be useful, and predications containing them are eliminated. For example Pharmaceutical Preparations TREATS Migraine is not included in the condensate for migraine because &amp;quot;Pharmaceutical Preparations&amp;quot; was determined to be generic.</Paragraph>
      <Paragraph position="5"> Phase 4 (saliency) is the final transforma its operations are adapted from TOPIC's (Hahn and Reimer, 1999) saliency operators. Frequency of occurrence for arguments, predicates, and predications are calculated, and those occurring more frequently than the average are kept in the condensate; others are eliminated. null Whe dications produced by SemRep for a full-text article with 214 sentences (Lisse et al., 2003) concerned with comparing naproxen and rofecoxib for treating osteoarthritis, with respect to effectiveness and gastrointestinal tolerability, the resulting condensate is given in  of a journal article on osteoarthritis</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML