File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/w06-3307_intro.xml
Size: 3,690 bytes
Last Modified: 2025-10-06 14:04:10
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-3307"> <Title>Integrating Co-occurrence Statistics with Information Extraction for Robust Retrieval of Protein Interactions from Medline</Title> <Section position="3" start_page="49" end_page="50" type="intro"> <SectionTitle> 2 Sentence-level relation extraction </SectionTitle> <Paragraph position="0"> Most systems that identify relations between entities mentioned in text documents consider only pair of entities that are mentioned in the same sentence (Ray and Craven, 2001; Zhao and Grishman, 2005; Bunescu and Mooney, 2005). To decide the existence and the type of a relationship, these systems generally use lexico-semantic clues inferred from the sentence context of the two entities. Much research has been focused recently on automatically identifying biologically relevant entities and their relationships such as protein-protein interactions or subcellular localizations. For example, the sentence &quot;TR6 specifically binds Fas ligand&quot;, states an interaction between the two proteins TR6 and Fas ligand.</Paragraph> <Paragraph position="1"> One of the first systems for extracting interactions between proteins is described in (Blaschke and Valencia, 2001). There, sentences are matched deterministically against a set of manually developed patterns, where a pattern is a sequence of words or Part-of-Speech (POS) tags and two protein-name tokens.</Paragraph> <Paragraph position="2"> Between every two adjacent words is a number indicating the maximum number of words that can be skipped at that position. An example is: &quot;interaction of (3) BOPBQ (3) with (3) BOPBQ&quot;. This approach is generalized in (Bunescu and Mooney, 2005), where subsequences of words (or POS tags) from the sentence are used as implicit features. Their weights are learned by training a customized subsequence kernel on a dataset of Medline abstracts annotated with proteins and their interactions.</Paragraph> <Paragraph position="3"> A relation extraction system that works at the sentence-level and which outputs normalized confidence values for each extracted pair of entities can also be used for corpus-level relation extraction. A straightforward way to do this is to apply an aggregation operator over the confidence values inferred for all occurrences of a given pair of entities. More</Paragraph> <Paragraph position="5"> in the entire corpus BV, then the confidence C8B4CAB4D4</Paragraph> <Paragraph position="7"> in a particular relationship CA is defined as: Out of the four operators in Table 1, we believe that the max operator is the most appropriate for aggregating confidence values at the corpus-level. The question that needs to be answered is whether there is a sentence somewhere in the corpus that asserts the relationship CA between entities D4</Paragraph> <Paragraph position="9"> . Also, the and operator would be most appropriate for finding whether CAB4D4</Paragraph> <Paragraph position="11"> is true in all corresponding sentences in the corpus.</Paragraph> <Paragraph position="12"> The value of the noisy-or operator (Pearl, 1986) is too dependent on the number of occurrences, therefore it is less appropriate for a corpus where the occurrence counts vary from one entity pair to another (as confirmed in our experiments from Section 6).</Paragraph> <Paragraph position="13"> For examples, if the confidence threshold is set at BCBMBH, and the entity pair B4D4</Paragraph> <Paragraph position="15"> or less, each with confidence BCBMBD, then CAB4D4</Paragraph> <Paragraph position="17"> false, according to the noisy-or operator. However, if B4D4</Paragraph> </Section> class="xml-element"></Paper>