File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-1208_metho.xml

Size: 8,793 bytes

Last Modified: 2025-10-06 14:09:19

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-1208">
  <Title>Distributed Modules for Text Annotation and IE applied to the Biomedical Domain</Title>
  <Section position="4" start_page="50" end_page="50" type="metho">
    <SectionTitle>
2 Available Modules
</SectionTitle>
    <Paragraph position="0"> The available modules belong to three categories: (1) basic NLP modules which mainly identify syntactical information, (2) modules which match controlled vocabularies, (3) modules which match a set of syntax patterns, and (4) modules for shallow parsing based on cascaded patterns. The categories are not independent, since named entity (NE) recognition relies on controlled vocabularies as well as on patterns for the identification of yet unknown NEs. Most modules match regular expressions (REs). These are matched with a finite state automata (FSA) engine we implemented. It is optimized for pipelined execution and huge REs.</Paragraph>
    <Paragraph position="1"> Basic NLP modules comprise the sentenciser and a part-of-speech (PoS) tagger. The sentenciser splits text into sentences and wraps them into a SENT XML element with a unique ID. The PoS tagger13 was trained on the British national corpus, but contains lexicon extensions for the biomedical concepts. Noun phrases (NPs) are identified with syntax patterns equivalent to</Paragraph>
  </Section>
  <Section position="5" start_page="50" end_page="51" type="metho">
    <SectionTitle>
DET (ADJ|ADV) N+.
</SectionTitle>
    <Paragraph position="0"> Controlled vocabularies Identification and tagging of terminology is a variant of NE recognition. In biology and medicine a large number of concepts is stored in databases like UniProt, where roughly 190000 database entries link PGNs to protein function, species and tissue type. PGNs from UniProt are transformed  into REs which account for morphological variability. For example col1a1 is transformed into the pattern (COL1A1|[cC]ol1a1) and IL-1 into (IL|[Ii]l)[- ]1. The PGNs available from the database automatically generate a suitable link from the text to one or more database entries. While adding more dictionaries is technically trivial, it creates the problem of conflicting definitions. Already UniProt introduces the uppercase concept names CAT, NOT and FOR as PGNs. Disambiguation of such definitions will be added as soon as available.</Paragraph>
    <Paragraph position="1"> Syntax patterns A number of IE tasks are solved with syntax patterns. The following modules are integrated into the server: (1) identification of abbreviations, (2) definitions of PGNs, and (3) identification of mutations.</Paragraph>
    <Paragraph position="2"> Abbreviation extraction is described for example in (Chang et al., 2002). In our approach a variety of patterns equivalent to NP '(' token ')' is used, where the token has to be the abbreviation of NP. If an abbreviation is found in the text without its expanded form, however, it is necessary to decide whether it is indeed an abbreviation and which expansion applies (work in progress).</Paragraph>
    <Paragraph position="3"> A separate module identifies sentence pieces where the author explicitely stated the fact that the concept denotes a PGN. Examples are The AZ2 protein was ...14 and PMP22 is the crucial gene ...15. Such examples were translated into the following four patterns: (1) the X protein, (2) the protein X, (3) T domain of NP, and (4) NP is a protein. The X denotes a single token and T represents a selection of concepts which are known to be used in conjunction with a protein. The tokens the, is, a and protein again represent sets of equivalent tokens.</Paragraph>
    <Paragraph position="4"> Identification of mutations is integrated as described in (Rebholz-Schuhmann et al., 2004). Integrated patterns identify nomenclature equivalent to AA [0-9]+ AA, where AA denotes all variants of an amino acid or nucleic acid. Apart from the infix representation of the mutation, any postfix and prefix representation is covered as well as other syntactical variation. NLP base IE One component identifies and highlights protein-protein interactions. It is essential that a phrase describing an interaction contains a verb or a nominal form describing an interaction like bind or dimerization. In to-</Paragraph>
    <Paragraph position="6"> tal, 21 verbs are considered including 10 verbs which are specific to molecular biology like farnesylate. A protein-protein interaction is identified and tagged, if such a verb phrase connects two noun phrases and if at least one of the NPs contains a PGN according to the terminology tagging.</Paragraph>
  </Section>
  <Section position="6" start_page="51" end_page="52" type="metho">
    <SectionTitle>
3 Pipeline of modules shared in
</SectionTitle>
    <Paragraph position="0"> distributed computing Obviously the presented modules do not work independently of each other. For example the protein-protein interaction module uses NP detection (basic NLP module) which itself relies on PoS tagging. In addition, NP detection integrates marked concepts from the terminology tagging module for the identification of protein-protein interactions. The modules form a pipeline equivalent to a UNIX pipe like &amp;quot;cat input.txt  |inputFilter |</Paragraph>
    <Paragraph position="2"> between the modules have to be kept in mind to determine their correct order. While the text passes through the pipeline, every filter picks the XML element it is responsible for and copies everything else unchanged to the output.</Paragraph>
    <Paragraph position="3"> The input filter wraps arbitrary natural language text into an XML element describing the source of the document. Any further module analyses the text and adds meta data (XML tags). The following synthesis phase combines the facts available into larger structures, e.g.</Paragraph>
    <Paragraph position="4"> mutations of a gene or protein-protein interactions. null Running the pipeline of modules on a single compute node leads to insufficient response time, since the modules tend to have large memory footprints. In particular the PoS-tagger as well as terminology taggers load large dictionaries into memory and therefore have considerable startup time, whereas steady state operation is fast. One solution which solves this problem is to implement each module as a dedicated server process, which is kept in memory for immediate response.</Paragraph>
    <Paragraph position="5"> REs are applied for processing of data and meta data. This leads to a special constraint in the handling of XML tags. It is well known that REs cannot match recursive parenthesized structures. As a result, XML elements used as meta data are not allowed to contain themselves. If the XML elements denote parts of a phrase structure of a natural language sen- null into communication components (comm). The controlling server sends a request to the last component in the pipe. Each component contacts its predecessor for input and routes it throught the module. The first component finally contacts back to the controlling server to fetch the input and send it down the pipe.</Paragraph>
    <Paragraph position="6"> tence, this may in principle be a restriction, but in practical applications it is not.</Paragraph>
    <Paragraph position="7"> We implemented a set of Java classes which allows to set up distributed pipelined processing. It solves the details of client/server communication to run IE modules in a pipeline and allows modification to and replacement of modules through the developer (researcher). As a result, any class with a method that reads from an input stream and writes results to an output stream can serve as a module. In Java terms, the applied interface is a java.lang.Runnable calling its methods in void run(). A general purpose server class is available which, given a factory method to create the Runnable, handles all the details of setting up and shutting down the connections. In particular, connections to establish a pipeline M1,- ... -,Mn, are created as follows (fig 1): The controlling server C generates the pipeline of modules M1,- ... -,Mn (fig. 1).</Paragraph>
    <Paragraph position="8"> Typically a component in the web server creates a reversed list of the modules and adds itself to the end of the list: Mn,...,M1,C. Then it removes Mn from the list, contacts Mn, sends it the shortened list Mn[?]1,...,M1,C and starts reading input from Mn. Module Mn follows the same procedure as the server and starts the Runnable which performs its function receiving input from the upstream server and writing output to the downstream server. All modules act the same way and finally M1 contacts the controlling server C to obtain the input data. Obviously C needs to write data to M1 and read data from Mn in parallel.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML