File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/w04-3111_intro.xml

Size: 6,208 bytes

Last Modified: 2025-10-06 14:02:50

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-3111">
  <Title>Integrated Annotation for Biomedical Information Extraction</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Work over the last few years in literature data mining for biology has progressed from linguistically unsophisticated models to the adaptation of Natural Language Processing (NLP) techniques that use full parsers (Park et al., 2001; Yakushiji et al., 2001) and coreference to extract relations that span multiple sentences (Pustejovsky et al., 2002; Hahn et al., 2002) (For an overview, see (Hirschman et al., 2002)). In this work we describe an approach to two areas of biomedical information extraction, drug development and cancer genomics, that is based on developing a corpus that integrates different levels of semantic and syntactic annotation. This corpus will be a resource for training machine learning algorithms useful for information extraction and retrieval and other datamining applications. We are currently annotating only abstracts, although in the future we plan to expand this to full-text articles. We also plan to make publicly available the corpus and associated statistical taggers.</Paragraph>
    <Paragraph position="1"> We are collaborating with researchers in the Division of Oncology at The Children's Hospital of Philadelphia, with the goal of automatically mining the corpus of cancer literature for those associations that link specified variations in individual genes with known malignancies.</Paragraph>
    <Paragraph position="2"> In particular we are interested in extracting three entities (Gene, Variation Event, and Malignancy) in the following relationship: Gene X with genomic Variation Event Y is correlated with Malignancy Z. For example, WT1 is deleted in Wilms Tumor #5. Such statements found in the literature represent individual gene-variation-malignancy observables. A collection of such observables serves two important functions. First, it summarizes known relationships between genes, variation events, and malignancies in the cancer literature. As such, it can be used to augment information available from curated public databases, as well as serve as an independent test for accuracy and completeness of such repositories. Second, it allows inferences to be made about gene, variation, and malignancy associations that may not be explicitly stated in the literature, both at the fact and entity instance levels. Such inferences provide testable hypotheses and thus future research targets.</Paragraph>
    <Paragraph position="3"> The other major area of focus, in collaboration with researchers in the Knowledge Integration and Discovery Systems group at GlaxoSmithKline (GSK), is the extraction of information about enzymes, focusing initially on compounds that affect the activity of the cytochrome P450 (CYP) family of proteins. For example, the goal is to see a phrase like Amiodarone weakly inhibited CYP2C9, CYP2D6, and CYP3A4-mediated activities  Previous work at GSK has used search algorithms that are based on pattern matching rules filling template slots. The rules rely on identifying the relevant passages by first identifying compound names and then associating them with a limited number of relational terms such as inhibit or inactivate. This is similar to other work in biomedical extraction projects (Hirschman et al., 2002).</Paragraph>
    <Paragraph position="4"> Creating good pattern-action rules for an IE problem is far from simple. There are many complexities in the different ways that a relation can be expressed in language, such as syntactic alternations and the heavy use of coordination. While sufficiently complex patterns can deal with these issues, it requires a good amount of time and effort to build such hand-crafted rules, particularly since such rules are developed for each specific problem. A corpus that is annotated with sufficient syntactic and semantic structure offers the promise of training taggers for quicker and easier information extraction.</Paragraph>
    <Paragraph position="5"> The corpus that we are developing for the two different application demands consists of three levels of annotation: the entities and relations among the entities for the oncology or CYP domain, syntactic structure (Treebank), and predicate-argument structure (Propbank). This is a novel approach from the point-of-view of NLP since previous efforts at Treebanking and Propbanking have been independent of the special status of any entities, and previous efforts at entity annotation have been independent of corresponding layers of syntactic and semantic structure. The decomposition of larger entities into components of a relation, worthwhile by itself on conceptual grounds for entity definition, also allows the component entities to be mapped to the syntactic structure. These entities can be viewed as semantic types associated with syntactic constituents, and so our expectation is that automated analyses of these related levels will interact in a mutually reinforcing and beneficial way for development of statistical taggers. Development of such statistical taggers is proceeding in parallel with the annotation effort, and these taggers help in the annotation process, as well as being steps towards automatic extraction.</Paragraph>
    <Paragraph position="6"> In this paper we focus on the aspects of this project that have been developed and are in production, while also trying to give enough of the overall vision to place the work that has been done in context. Section 2 discusses some of the main issues around the development of the guidelines for entity annotation, for both the oncology and inhibition domains. Section 3 first discusses the overall plan for the different levels of annotation, and then focuses on the integration of the two levels currently in production, entity annotation and syntactic structure.</Paragraph>
    <Paragraph position="7"> Section 4 describes the flow of the annotation process, including the development of the statistical taggers mentioned above. Section 5 is the conclusion.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML