File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/w06-3307_intro.xml

Size: 3,690 bytes

Last Modified: 2025-10-06 14:04:10

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-3307">
  <Title>Integrating Co-occurrence Statistics with Information Extraction for Robust Retrieval of Protein Interactions from Medline</Title>
  <Section position="3" start_page="49" end_page="50" type="intro">
    <SectionTitle>
2 Sentence-level relation extraction
</SectionTitle>
    <Paragraph position="0"> Most systems that identify relations between entities mentioned in text documents consider only pair of entities that are mentioned in the same sentence (Ray and Craven, 2001; Zhao and Grishman, 2005; Bunescu and Mooney, 2005). To decide the existence and the type of a relationship, these systems generally use lexico-semantic clues inferred from the sentence context of the two entities. Much research has been focused recently on automatically identifying biologically relevant entities and their relationships such as protein-protein interactions or subcellular localizations. For example, the sentence &amp;quot;TR6 specifically binds Fas ligand&amp;quot;, states an interaction between the two proteins TR6 and Fas ligand.</Paragraph>
    <Paragraph position="1"> One of the first systems for extracting interactions between proteins is described in (Blaschke and Valencia, 2001). There, sentences are matched deterministically against a set of manually developed patterns, where a pattern is a sequence of words or Part-of-Speech (POS) tags and two protein-name tokens.</Paragraph>
    <Paragraph position="2"> Between every two adjacent words is a number indicating the maximum number of words that can be skipped at that position. An example is: &amp;quot;interaction of (3) BOPBQ (3) with (3) BOPBQ&amp;quot;. This approach is generalized in (Bunescu and Mooney, 2005), where subsequences of words (or POS tags) from the sentence are used as implicit features. Their weights are learned by training a customized subsequence kernel on a dataset of Medline abstracts annotated with proteins and their interactions.</Paragraph>
    <Paragraph position="3"> A relation extraction system that works at the sentence-level and which outputs normalized confidence values for each extracted pair of entities can also be used for corpus-level relation extraction. A straightforward way to do this is to apply an aggregation operator over the confidence values inferred for all occurrences of a given pair of entities. More</Paragraph>
    <Paragraph position="5"> in the entire corpus BV, then the confidence C8B4CAB4D4</Paragraph>
    <Paragraph position="7"> in a particular relationship CA is defined as:  Out of the four operators in Table 1, we believe that the max operator is the most appropriate for aggregating confidence values at the corpus-level. The question that needs to be answered is whether there is a sentence somewhere in the corpus that asserts the relationship CA between entities D4</Paragraph>
    <Paragraph position="9"> . Also, the and operator would be most appropriate for finding whether CAB4D4</Paragraph>
    <Paragraph position="11"> is true in all corresponding sentences in the corpus.</Paragraph>
    <Paragraph position="12"> The value of the noisy-or operator (Pearl, 1986) is too dependent on the number of occurrences, therefore it is less appropriate for a corpus where the occurrence counts vary from one entity pair to another (as confirmed in our experiments from Section 6).</Paragraph>
    <Paragraph position="13"> For examples, if the confidence threshold is set at BCBMBH, and the entity pair B4D4</Paragraph>
    <Paragraph position="15"> or less, each with confidence BCBMBD, then CAB4D4</Paragraph>
    <Paragraph position="17"> false, according to the noisy-or operator. However, if B4D4</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML