File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/05/w05-1308_intro.xml
Size: 4,831 bytes
Last Modified: 2025-10-06 14:03:18
<?xml version="1.0" standalone="yes"?> <Paper uid="W05-1308"> <Title>IntEx: A Syntactic Role Driven Protein-Protein Interaction Extractor for Bio-Medical Text</Title> <Section position="2" start_page="0" end_page="54" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Genomic research in the last decade has resulted in the production of a large amount of data in the form of micro-array experiments, sequence information and publications discussing the discoveries. The data generated by these experiments is highly [?] To whom correspondence should be addressed connected; the results from sequence analysis and micro-arrays depend on functional information and signal transduction pathways cited in peerreviewed publications for evidence. Though scientists in the field are aided by many online data-bases of biochemical interactions, currently a majority of these are curated labor intensively by domain experts. Information extraction from text has therefore been pursued actively as an attempt to extract knowledge from published material and to speed up the curation process significantly.</Paragraph> <Paragraph position="1"> In the biomedical context, the first step towards information extraction is to recognize the names of proteins (Fukuda, Tsunoda et al. 1998), genes, drugs and other molecules. The next step is to recognize interaction events between such entities (Blaschke, Andrade et al. 1999; Blaschke, Andrade et al. 1999; Hunter 2000; Thomas, Milward et al.</Paragraph> <Paragraph position="2"> 2000; Thomas, Rajah et al. 2000; Ono, Hishigaki et al. 2001; Hahn and Romacker 2002) and then to finally recognize the relationship between interaction events. However, several issues make extracting such interactions and relationships difficult since (Seymore, McCallum et al.1999) (i) the task involves free text - hence there are many ways of stating the same fact (ii) the genre of text is not grammatically simple (iii) the text includes a lot of technical terminology unfamiliar to existing natural language processing systems (iv) information may need to be combined across several sentences, and (v) there are many sentences from which nothing should be extracted.</Paragraph> <Paragraph position="3"> In this paper, we present a fully automated extraction approach to identify gene and protein interact- null tions in natural language text with the help of biomedical and linguistic ontologies. Our approach works in three main stages: 1. Complex Sentence Processor (CSP): First, is splitting complex sentences into simple clausal structures made of up syntactic roles.</Paragraph> <Paragraph position="4"> 2. Tagging: Then, tagging biological entities with the help of biomedical and linguistic ontologies. null 3. Interaction Extractor: Finally, extracting complete interactions by analyzing the matching contents of syntactic roles and their linguistically significant combinations.</Paragraph> <Paragraph position="5"> The novel aspects of our system are its ability to handle complex sentence structures using the Complex Sentence Processor (CSP) and to extract multiple and nested interactions specified in a sentence using the Interaction Extractor without the labor intensive pattern engineering requirement.</Paragraph> <Paragraph position="6"> Our approach is based on identification of syntactic roles, such as subject, objects, verb and modifiers, by using the word dependencies. We have used a dependency based English grammar parser, the Link Grammar (Sleator and Temperley 1993), to identify the roles. Syntactic roles are utilized to transform complex sentences into their multiple clauses each containing a single event. This clausal structure enables us to engineer an automated algorithm for the extraction of events thus overcoming the burden of labor intensive pattern engineering for complex and compound sentences. Pronoun resolution module assists Interaction Extractor in identifying interactions spread across multiple sentences using pronominal references. We performed comparative experimental evaluations with two state of the art systems. Our experimental results show that the IntEx system presented here achieves better performance without the labor intensive rule engineering step which is required for these state of the art systems.</Paragraph> <Paragraph position="7"> The rest of the paper is organized as follows. In Section 2 we survey the related work. In Section 3 we present an architectural overview of the IntEx system. Sections 4 and 5 explain and illustrate the individual modules of the IntEx system. A detailed evaluation of our system with the BioRAT (Corney, Buxton et al. 2004) and GeneWays (Rzhetsky, Iossifov et al. 2004) is presented in Section 6. Section 7 concludes the paper.</Paragraph> </Section> class="xml-element"></Paper>