File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/a00-1026_metho.xml

Size: 15,675 bytes

Last Modified: 2025-10-06 14:07:02

<?xml version="1.0" standalone="yes"?>
<Paper uid="A00-1026">
  <Title>Extracting Molecular Binding Relationships from Biomedical Text</Title>
  <Section position="2" start_page="188" end_page="191" type="metho">
    <SectionTitle>
1 Extracting Binding Relationships
from Text
</SectionTitle>
    <Paragraph position="0"> Our strategy for extracting binding relationships from text divides the task into two phases: During the first phase we identify all potential binding arguments, and then in the second phase we extract just those binding terms which are asserted in the text as participating in a particular binding predication. In support of this processing, we rely on the linguistic and domain knowledge contained in the National Library of</Paragraph>
    <Section position="1" start_page="188" end_page="188" type="sub_section">
      <SectionTitle>
Medicine's Unified Medical Language System ~
</SectionTitle>
      <Paragraph position="0"> (UMLS (r)) as well an existing tool, the SPECIALIST minimal commitment parser (Aronson et al. 1994).</Paragraph>
      <Paragraph position="1"> The UMLS (Humphreys et al. 1998) consists of several knowledge sources applicable in the biomedical domain: the Metathesaums, Semantic Network, and SPECIALIST Lexicon (McCray et al. 1994). The Metathesaurus was constructed from more than forty controlled vocabularies and contains more than 620,000 biomedical concepts. The characteristic of the Metathesaurus most relevant for this project is that each concept is associated with a semantic type that categorizes the concept into subareas of biology or medicine. Examples pertinent to binding terminology include the semantic types 'Amino Acid, Peptide, or Protein' and 'Nucleotide Sequence'. The SPECIALIST Lexicon (with associated lexical access tools) supplies syntactic information for a large compilation of biomedical and general English terms.</Paragraph>
      <Paragraph position="2"> The SPECIALIST minimal commitment parser relies on the SPECIALIST Lexicon as well as the Xerox stochastic tagger (Cutting et al. 1992). The output produced is in the tradition of partial parsing (Hindle 1983, McDonald 1992, Weischedel et al. 1993) and concentrates on the simple noun phrase, what Weischedel et al. (1993) call the &amp;quot;core noun phrase,&amp;quot; that is a noun phrase with no modification to the right of the head. Several approaches provide similar output based on statistics (Church 1988, Zhai 1997, for example), a finite-state machine (Ait-Mokhtar and Chanod 1997), or a hybrid approach combining statistics and linguistic rules (Voutilainen and Padro 1997).</Paragraph>
      <Paragraph position="3"> The SPECIALIST parser is based on the notion of barrier words (Tersmette et al. 1988), which indicate boundaries between phrases. After lexical look-up and resolution of category label ambiguity by the Xerox tagger, complementizers, conjunctions, modals, prepositions, and verbs are marked as boundaries. Subsequently, boundaries are considered to open a new phrase (and close the preceding phrase). Any phrase containing a noun is considered to be a (simple) noun phrase, and in such a phrase, the right-most noun is labeled as the head, and all other items (other than determiners) are labeled as modifiers. An example of the output from the SPECIALIST parser is given below in (4). The partial parse produced serves as the basis for the first phase of extraction of binding relationships, namely the identification of those simple noun phrases acting as potential binding arguments (referred to as &amp;quot;binding terms&amp;quot;).</Paragraph>
    </Section>
    <Section position="2" start_page="188" end_page="189" type="sub_section">
      <SectionTitle>
1.1 Identifying binding terminology
</SectionTitle>
      <Paragraph position="0"> In order to identify binding terminology in text we rely on the approach discussed in (Rindfiesch et al. 1999). Text with locally-defined acronyms expanded is submitted to the Xerox tagger and the SPECIALIST parser. Subsequent processing concentrates on the heads of simple noun  phrases and proceeds in a series of cascaded steps that depend on existing domain knowledge as well as several small, special-purpose resources in order to determine whether each noun phrase encountered is to be considered a binding term.</Paragraph>
      <Paragraph position="1"> As the first step in the process, an existing program, MetaMap, (Aronson et al. 1994) attempts to map each simple noun phrase to a concept in the UMLS Metathesaurus. The semantic type for concepts corresponding to successfully mapped noun phrases is then checked against a small subset of UMLS semantic types referring to bindable entities, such as 'Amino Acid, Peptide, or Protein', 'Nucleotide Sequence', 'Carbohydrate', 'Cell', and 'Virus'. For concepts with a semantic type in this set, the corresponding noun phrase is considered to be a binding term.</Paragraph>
      <Paragraph position="2"> The heads of noun phrases that do not map to a concept in the Metathesaurus are tested against a small set of general &amp;quot;binding words,&amp;quot; which often indicate that the noun phrase in which they appear is a binding term. The set of binding words includes such nouns as cleft, groove, membrane, ligand, motif, receptor, domain, element, and molecule.</Paragraph>
      <Paragraph position="3"> The head of a noun phrase that did not submit to the preceding steps is examined to see whether it adheres to the morphologic shape of a normal English word. In this context such a word is often an acronym not defined locally and indicates the presence of a binding term (Fukuda et al. 1998). A normal English word has at least one vowel and no digits, and a text token that contains at least one letter and is not a norreal English word functions as a binding word in this context.</Paragraph>
      <Paragraph position="4"> The final step in identifying binding terms is to join contiguous simple noun phrases qualifying as binding terms into a single macro-noun phrase. Rindflesch et al. (1999) use the term &amp;quot;macro-noun phrase&amp;quot; to refer to structures that include reduced relative clauses (commonly introduced by prepositions or participles) as well as appositives. Two binding terms joined by a form of be are also treated as though they formed a macro-noun phrase, as in Jel42 is an IgG which binds ...</Paragraph>
      <Paragraph position="5"> The results of identifying binding terms (and thus potential binding arguments) are given in (4) for the sentence in (3). In (4) evidence supporting identification as a binding term is given in braces. Note that in the underspecified syntactic analysis, prepositional phrases are treated as (simple) noun phrases that have a preposition as their first member.</Paragraph>
      <Paragraph position="6">  (3) Je142 is an IgG which binds to the small bacterial protein, HPr and the structure of the complex is known at high resolution.</Paragraph>
      <Paragraph position="7"> (4) \[binding_term(\[ head(Je142)\], { Morphology Shape Rule }</Paragraph>
      <Paragraph position="9"/>
    </Section>
    <Section position="3" start_page="189" end_page="191" type="sub_section">
      <SectionTitle>
1.2 Identifying binding terms as
</SectionTitle>
      <Paragraph position="0"> arguments of relationships Before addressing the strategy for determining the arguments of binding predications, we discuss the general treatment of macro-noun phrases during the second part of the processing. Although ARBITER attempts to recover complete macro-noun phrases during the first phase, only the most specific (and biologically useful) part of a macro-noun phrase is recovered during the extraction of binding predications. Terms referring to specific molecules are more useful than those referring to general classes of bindable entities, such as receptor, ligand, protein, or molecule. The syntactic head of a macro-noun phrase (the first simple noun phrase in the list) is not always the most specific or most useful term in the construction.</Paragraph>
      <Paragraph position="1"> l_qt~  The Specificity Rule for determining the most specific part of the list of simple binding terms constituting a macro-noun phrase chooses the first simple term in the list which has either of the following two characteristics: a) The head was identified by the Morphology Shape Rule.</Paragraph>
      <Paragraph position="2"> b) The noun phrase maps to a UMLS concept having one of the following semantic types: 'Amino Acid, Peptide, or Protein', 'Nucleic Acid, Nucleoside, or Nucleotide', 'Nucleotide Sequence', 'Immunologic Factor', or 'Gene or Genome'. For example, in (5), the second simple term, TNF-alpha promoter, maps to the Metathesaurus with semantic type 'Nucleotide Sequence' and is thus considered to be the most specific term in this complex-noun phrase.</Paragraph>
      <Paragraph position="3"> (5) binding_term( \[transcriptionally active kappaB motifs\], \[in the TNF-alpha promoter\], \[in normal cells\]) In identifying binding terms as arguments of a complete binding predication, as indicated above, we examine only those binding relations cued by some form of the verb bind (bind, binds, bound, and binding). The list of minimal syntactic phrases constituting the partial parse of the input sentence is examined from left to right; for each occurrence of a form of binds, the two binding terms serving as arguments are then sought. (During the tagging process, we force bind, binds, and bound to be labeled as &amp;quot;verb,&amp;quot; and binding as &amp;quot;noun.&amp;quot;) A partial analysis of negation and coordination is undertaken by ARBITER, but anaphora resolution and a syntactic treatment of relativization are not attempted. With the added constraint that a binding argument must have been identified as a binding term based on the domain knowledge resources used, the partial syntactic analysis available to ARBITER supports the accurate identification of a large number of binding predications asserted in the research literature.</Paragraph>
      <Paragraph position="4">  It is convenient to categorize binding predications into two classes depending on which form of bind cues the predication: a) binding and b) bind, binds, and bound. In our test collection (discussed below), about half of the binding relationships asserted in the text are cued by the gerundive or participial form binding. In this syntactic predication, the resources available from the underspecified syntactic parse serve quite well as the basis for correctly identifying the arguments of the binding relationship.</Paragraph>
      <Paragraph position="5"> The most common argument configuration associated with binding is for both arguments to occur to the right, cued by prepositions, most commonly of and to; however, other frequent patterns are of-by and to-by. Another method of argument cuing for binding is for the subject of the predication to function syntactically as a modifier of the head binding in the same simple noun phrase. The object in this instance is then cued by either of or to (to the right). A few other patterns are seen and some occurrences of binding do not cue a complete predication; either the subject is missing or neither argument is explicitly mentioned. However, the examples in (6) fairly represent the interpretation of binding.</Paragraph>
      <Paragraph position="6">  The arguments of forms of bind other than binding invariably occur on either side of the cuing verb form. The default strategy for identifying both arguments in these instances is to choose the closest binding term on either side of the verb. In the cases we have investigated, this strategy works often enough to be useful for the surface object. However, due to predicate coordination as well as relativization, such a strategy often fails to identify correctly the surface sub-ject of bind (binds or bound) when more than  one binding term precedes the verb. We therefore use the strategy summarized in (7) for recovering the surface subject in such instances. (7) When more than one binding term precedes a form of bind other than binding, choose the most specific of these binding terms as the surface subject of the predication.</Paragraph>
      <Paragraph position="7"> &amp;quot;Most specific&amp;quot; is determined (recursively) for a series of binding terms in the same way that the most specific part of a complex binding term is determined.</Paragraph>
      <Paragraph position="8"> The input text (8) provides an example of a binding predication cued by binds in which the arguments appear (immediately) on either side of the cuing verb. The two macro-noun phrases serving as potential arguments are underlined.</Paragraph>
      <Paragraph position="9">  (8) A transcription factor, Auxin Response Factor 1, that binds to tl!e sequence TGTCTC in auxin response elements was cloned from Arabidopsis by using a yeast one-hybrid system. null (9) &lt;auxin response factor 1&gt;</Paragraph>
    </Section>
  </Section>
  <Section position="3" start_page="191" end_page="192" type="metho">
    <SectionTitle>
BINDS
</SectionTitle>
    <Paragraph position="0"> &lt;sequence tgtctc&gt; In the extracted binding relationship in (9), the Specificity Rule chooses Auxin Response Factor 1 from the first macro-noun phrase because it maps to the UMLS Metathesaurus with semantic type 'Amino Acid, Peptide, or Protein'. In the second argument, the sequence TGTCTC has a head that submits to the Morphology Shape Rule and hence is considered to be more specific than auxin response elements.</Paragraph>
    <Paragraph position="1"> In (10), the Specificity Rule applies correctly to select the surface subject of the binding predication when multiple binding terms appear to the left of the verb.</Paragraph>
    <Paragraph position="2"> (10) Phosphatidylinositol transfer protein has a single lipid-binding site that can reversibly bind phosphatidylinositol and phosphatidylcholine and transfer these lipids between membrane compartments in vitro.</Paragraph>
    <Paragraph position="3">  single lipid-binding site occur to the left of bind and have been identified as binding terms by the first phase of processing. However, Phosphatidylinositol transfer protein maps to the corresponding Metathesaurus concept with semantic type 'Amino Acid, Peptide, or Protein, thus causing it to be more specific than a single lipid-binding site. The second predication listed in (10) was correctly extracted due to coordination processing.</Paragraph>
    <Paragraph position="4"> ARBITER pursues limited coordination identification in the spirit of Agarwal and Boggess (1992) and Rindflesch (1995). Only binding terms are considered as candidates for coordination. For each conjunction encountered, the phrase immediately to the right is examined; if it is a binding term, all contiguous binding terms occurring immediately to the left of the conjunct are considered to be coordinate with the right conjunct. Coordination inside the simple noun phrase is not considered, and therefore structures such as The TCR alpha beta or -gamma delta chains are not recognized. Nonetheless, as indicated in (11), this limited approach to noun phrase coordination is often effective.</Paragraph>
    <Paragraph position="5"> (11) Purified recombinant NC 1, like authentic NC 1, also bound specifically to fibronectin, collagen type I, and a laminin 5/6 complex.</Paragraph>
    <Paragraph position="6">  Although the particular underspecified syntactic analysis used in the identification of binding predications in the biomedical research literature is limited in several important ways, it appears adequate to enable this project with a useful level of effectiveness, and this is supported by evaluation.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML