File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/p05-3017_metho.xml

Size: 8,570 bytes

Last Modified: 2025-10-06 14:09:49

<?xml version="1.0" standalone="yes"?>
<Paper uid="P05-3017">
  <Title>Supporting Annotation Layers for Natural Language Processing</Title>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 The Layered Query Language
</SectionTitle>
    <Paragraph position="0"> Our framework differs from others by simultane- null ously supporting several key features: + Multiple overlapping layers (which cannot be expressed in a single XML file), including self-overlapping (e.g., a word shared by two phrases from the same layer), and parallel layers, as when multiple syntactic parses span the same text.</Paragraph>
    <Paragraph position="1"> + Integration of multiple intersecting hierarchies (e.g., MeSH, UMLS, WordNet).</Paragraph>
    <Paragraph position="2"> + Flexible results format.</Paragraph>
    <Paragraph position="3"> + Tight integration with SQL, including application of SQL operators over the returned results. + Scalability to large collections such as MEDLINE (containing millions of documents).4  While existing systems possess some of these features, none offers all of them.</Paragraph>
    <Paragraph position="4"> We assume that the underlying text is fairly static. While we support addition, removal and editing of annotations via a Java API, we do not optimize for efficient editing, but instead focus on compact representation, easy query formulation, easy addition and removal of layers, and straightforward translation into SQL. Below we illustrate our Layered  of the language and additional examples.</Paragraph>
    <Paragraph position="5"> Figure 1 illustrates the layered annotation of a sentence from biomedical text. Each annotation represents an interval spanning a sequence of characters, using absolute beginning and ending positions. Each layer corresponds to a conceptually different kind of annotation (e.g., word, gene/protein6, shallow parse). Layers can be sequential, overlapping (e.g., two concepts sharing the same word), and hierarchical (either spanning, when the intervals are nested as in a parse tree, or ontologically, when the token itself is derived from a hierarchical ontology). Word, POS and shallow parse layers are sequential (the latter can skip or span multiple words). The gene/protein layer assigns IDs from the LocusLink database of gene names.7 For a given gene there are as many LocusLink IDs as the number of organisms it is found in (e.g., 4 in the case of the gene Bcl-2). The MeSH layer contains entities from the hierarchical medical ontology MeSH (Medical Subject Headings).8 The MeSH annotations on Figure 1 are overlapping (share the word cell) and hierarchical both ways: spanning, since blood cell (with MeSH id D001773) orthographically spans the word cell (id A11), and ontologically, since blood cell is a kind of cell and cell death (id D016923) is a kind of Biological Phenomena.</Paragraph>
    <Paragraph position="6"> Given this annotation, we can extract potential protein-protein interactions from MEDLINE text.</Paragraph>
    <Paragraph position="7"> One simple approach is to follow (Blaschke et al., 1999), who developed a list of verbs (and their derived forms) and scanned for sentences containing the pattern PROTEIN ... INTERACTION-VERB ...</Paragraph>
    <Paragraph position="8"> PROTEIN. This can be expressed in LQL as follows:</Paragraph>
    <Paragraph position="10"> This example extracts sentences containing a protein name in the gene/protein layer, followed by any sequence of words (because of ALLOW GAPS), followed by the interaction verb activates, followed by any sequence of words, and finally followed by an- null other protein name. All possible protein matches within the same sentence will be returned. The results are presented as pairs of protein names.</Paragraph>
    <Paragraph position="11"> Each query level specifies a layer (e.g., sentence, part-of-speech, gene/protein) and optional restrictions on the attribute values. A binding statement is allowed after the layer's closing bracket. We can search for more than one verb simultaneously, e.g., by changing the POS layer of the query above</Paragraph>
    <Paragraph position="13"> Further, a wildcard like content 'activate%' can match the verb forms activate, activates and activated. We can also use double quotes &amp;quot; to make the comparison case insensitive. Finally, since LQL is automatically translated into SQL, SQL code can be written to surround the LQL query and to reference its results, thus allowing the use of SQL operators such as GROUP BY, COUNT, DISTINCT, ORDER BY, etc., as well as set operations like UNION.</Paragraph>
    <Paragraph position="14"> Now consider the task of extracting interactions between chemicals and diseases. Given the sentence Adherence to statin prevents one coronary heart disease event for every 429 patients. , we want to extract the relation that statin (potentially) prevents coronary heart disease. The latter is in the MeSH hierarchy (id D003327) with tree codes C14.280.647.250 and C14.907.553.470.250, while the former is listed in the MeSH supplementary concepts (ID C047068). In fact, the whole C subtree in MeSH contains diseases and all supplementary MeSH concepts represent chemicals. So we can find potentially useful sentences (to be further processed by another algorithm) using the following query:  This looks for sentences containing two NPs in any order without overlaps (NO ORDER) and separated by any number of intervening elements. We further require one of the NPs to end (ensured by the $ symbol) with a chemical, and the other (the disease) to end with a MeSH term from the C subtree.</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 System Architecture
</SectionTitle>
    <Paragraph position="0"> Our basic model is similar to that of TIPSTER (Grishman, 1996): each annotation is stored as a record, which specifies the character-level beginning and ending positions, the layer and the type. The basic table9 contains the following columns: (1) annotation id; (2) doc id; (3) section: title, abstract or body; (4) layer id: layer identifier (word, POS, shallow parse, sentence, etc.); (5) start char pos: beginning character position, relative to section and doc id; (6) end char pos: ending character position; (7) tag type: a layer-specific token identifier. After evaluating various different extensions of the structure above, we have arrived at one with some additional columns, which improves cross-layer query performance: (8) sentence id; (9) word id; (10) rst word pos; and (11) last word pos. Columns (9)-(11) treat the word layer as atomic and require all annotations to coincide with word boundaries.</Paragraph>
    <Paragraph position="1"> Finally, we use two types of composite indexes: forward, which looks for positions in a given document, and inverted, which supports searching based on annotation values.10 An index lookup can be performed on any column combination that corresponds to an index pre x. An RDBMS' query optimizer estimates the optimal access paths (index and table scans), and join orders based on statistics collected over the stored records. In complex queries a com- null bination of forward (F) and inverted (I) indexes is typically used. The particular ones we used are:11 (F) +doc id+section+layer id+sentence +first word pos+last word pos+tag type (I) +layer id+tag type+doc id+section+sentence +first word pos+last word pos (I) +word id+layer id+tag type+doc id+section  +sentence+first word pos We have experimented with the system on a collection of 1.4 million MEDLINE abstracts, which include 10 million sentences annotated with 320 million multi-layered annotations. The current data-base size is around 70 GB. Annotations are indexed as they are inserted into the database.</Paragraph>
    <Paragraph position="2"> 9There are some additional tables mapping token IDs to entities (the string in case of a word, the MeSH label(s) in case of a MeSH term etc.) 10These inverted indexes can be seen as a direct extension of the widely used inverted le indexes in traditional IR systems. 11There is also an index on annotation id, which allows for annotating relations between annotations.</Paragraph>
    <Paragraph position="3"> Our initial evaluation shows variation in the execution time, depending on the kind and complexity of the query. Response time for simple queries is usually less than a minute, while for more complex ones it can be much longer. We are in the process of further investigating and tuning the system.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML