File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/00/a00-1026_concl.xml

Size: 3,670 bytes

Last Modified: 2025-10-06 13:52:38

<?xml version="1.0" standalone="yes"?>
<Paper uid="A00-1026">
  <Title>Extracting Molecular Binding Relationships from Biomedical Text</Title>
  <Section position="5" start_page="193" end_page="193" type="concl">
    <SectionTitle>
3 Application
</SectionTitle>
    <Paragraph position="0"> As an initial application of ARBITER we ran the program on 491,356 MEDLINE citations, which were retrieved using the same search strategy responsible for the gold standard. During this run, 331,777 sentences in 192,997 citations produced 419,782 total binding assertions. Extrapolating from the gold standard evaluation, we assume that this is about half of the total binding predications asserted in the citations processed and that somewhat less than three quarters of those extracted are correct.</Paragraph>
    <Paragraph position="1"> The initial list of 419,982 binding triples represents what ARBITER determined was asserted in the text being processed. Many of these assertions, such as those in (14), while correct, are too general to be useful.</Paragraph>
    <Paragraph position="2">  Further processing on ARBITER raw output extracted specific protein names and genomic structures and reduced the number of such binding predications to 345,706. From these more specific binding predication, we began the construction of a database containing binding relations asserted in the literature. More detailed discussion of this database can be found in (Rajan et al. in prep); however, here we give an initial description of its characteristics.</Paragraph>
    <Paragraph position="3"> We submitted the 345,706 more specific ARBITER binding predications to a search in GenBank (Benson et al. 1998) and determined that 106,193 referred to a GenBank entry. The number of Genbank entries with at least one binding assertion is 11,617. Preliminary results indicate that the database we are constructing will have some of the following characteristics: * 10,769 bindings between two distinct Genbank entries (5,569 unique) * 875 more binding assertions found between an entry and a specific DNA sequence * 27,345 bindings between a Genbank entry and a UMLS Metathesaurus concept * 5,569 unique relationships among pairs of entries (involving 11,617 unique entries) Conclusion The cooperation of structured domain knowledge and underspecified syntactic analysis enables the extraction of macromolecular binding relationships from the research literature. Although our implementation is domain-specific, the underlying principles are amenable to broader applicability.</Paragraph>
    <Paragraph position="4"> ARBITER makes a distinction between first labeling binding terms and then identifying certain of these terms as arguments in a binding predication. The first phase of this processing is dependent on biomedical domain knowledge accessible from the UMLS. Applying the techniques we propose in other areas would require at least a minimum of semantic classification of the concepts involved. General, automated techniques that could supply this requirement are becoming increasingly available (Morin and Jacquemin 1999, for example).</Paragraph>
    <Paragraph position="5"> Although we concentrated on the inflectional forms of a single verb, the principles we invoke to support argument identification during the second phase of processing apply generally to English predication encoding strategies (with a minimum of effort necessary to address prepositional cuing of gerundive arguments for specific verbs). The approach to noun phrase coordination also applies generally, so long as hypernymic classification is available for the heads of the potential conjuncts.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML