File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/w05-1308_metho.xml

Size: 9,855 bytes

Last Modified: 2025-10-06 14:10:01

<?xml version="1.0" standalone="yes"?>
<Paper uid="W05-1308">
  <Title>IntEx: A Syntactic Role Driven Protein-Protein Interaction Extractor for Bio-Medical Text</Title>
  <Section position="4" start_page="55" end_page="55" type="metho">
    <SectionTitle>
3 System Architecture
</SectionTitle>
    <Paragraph position="0"> The sentences in English are classified as either simple, complex, compound or complexcompound based on the number and types of clauses present in them. Our extraction system resolves the complex, compound and complexcompound sentence structures (collectively referred to as complex sentence structures in this document) into simple sentence clauses which contain a subject and a predicate. These simple sentence clauses are then processed to obtain the interactions between proteins. The architecture of the IntEx system is shown in Figure 1, and the following Sections 4 and 5 explain the workings of its modules.</Paragraph>
  </Section>
  <Section position="5" start_page="55" end_page="58" type="metho">
    <SectionTitle>
4 Complex Sentence Processing
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="55" end_page="56" type="sub_section">
      <SectionTitle>
4.1 Pronoun Resolution
</SectionTitle>
      <Paragraph position="0"> Interactions are often specified through pronominal references to entities in the discourse, or through co references where, a number of phrases are used to refer to the same entity. Hence, a complete approach to extracting information from text should also take into account the resolution of these references. References to entities are generally categorized as co-references or anaphora and has been investigated using various approaches (Castano, Zhang et al. 2002). IntEx anaphora resolution sub-system currently focuses on third person pronouns and reflexives since the first and second person pronouns are frequently used to refer to the authors of the papers.</Paragraph>
      <Paragraph position="1"> Our pronoun resolution module uses a heuristic approach to identify the noun phrases referred by the pronouns in a sentence. The heuristic is based on the number of the pronoun (singular or plural) and the proximity of the noun phrase. The first noun phrase that matches the number of the pronoun is considered as the referred phrase.</Paragraph>
    </Section>
    <Section position="2" start_page="56" end_page="56" type="sub_section">
      <SectionTitle>
4.2 Entity Tagger
</SectionTitle>
      <Paragraph position="0"> The entity tagging module marks the names of genes, and proteins in text. The process of tagging is a combination of dictionary look up and heuristics. Regular expressions are also used to mark the names that do not have a match in the dictionaries.</Paragraph>
      <Paragraph position="1"> The protein name dictionaries for the entity tagger are derived from various biological sources such as  with 'The SAC6 gene'. c) Each row represents a simple sentence, d) for each constituent, role type is resolved and interaction words are tagged, e) Protein-Protein interaction is extracted.</Paragraph>
    </Section>
    <Section position="3" start_page="56" end_page="56" type="sub_section">
      <SectionTitle>
4.3 Preprocessor
</SectionTitle>
      <Paragraph position="0"> The tagged sentences need to be pre-processed to replace syntactic constructs, such as parenthesized nouns and domain specific terminology that cause the Link Grammar Parser to produce an incorrect output. This problem is overcome by replacing such elements with alternative formats that is recognizable by the parser.</Paragraph>
    </Section>
    <Section position="4" start_page="56" end_page="57" type="sub_section">
      <SectionTitle>
4.4 Link Grammar and the Link grammar
</SectionTitle>
      <Paragraph position="0"> parser Link grammar (LG) introduced by Sleator and Temperley (Sleator and Temperley 1991) is a dependency based grammatical system. The basic idea of link grammar is to connect pairs of words  in a sentence with various syntactically significant links. The LG consists of set of words, each of which has various alternative linking requirements. A linking requirement can be seen as a block with connectors above each word. A connector is satisfied by matching it with compatible connector. Fig.2 below shows how linking requirements can be satisfied to produce a parse for the example sentence &amp;quot;The dog chased a cat&amp;quot;.</Paragraph>
      <Paragraph position="1"> Even though LG has no explicit notion of constituents or categories (Sleator and Temperley 1993), they emerge as contiguous connected sequence of words attached to the rest of sentence by a particular types of links, as in the above example where 'the dog' and 'a cat' are connected to the main verb via 'S' and 'O' links respectively. Our algorithms utilize this property of LG where certain link types allow us to extract the constituents of sentences irrespective of the tense. The LG parser's ability to detect multiple verbs and their constituent linkage in complex sentences makes it particularly well suited for our approach during resolving of complex sentences into their multiple clauses. The LG parsers' dictionary can also be easily enhanced to produce better parses for biomedical text (Szolovits 2003).</Paragraph>
    </Section>
    <Section position="5" start_page="57" end_page="57" type="sub_section">
      <SectionTitle>
4.5 Complex Sentence Processor Algorithm
</SectionTitle>
      <Paragraph position="0"> The complex sentence processor (CSP) component splitsthe complex sentences into a collection of simple sentence clauses which contain a subject and a predicate. The CSP follows a verb-based approach to extract the simple clauses. A sentence is identified to be complex it contains more than one verb. A simple sentence is identified to be one with a subject, a verb, objects and their modifying phrases. The example in Figure 3 illustrates the major steps involved during complex sentence processing. The following schema is used as the format to represent simple clauses:  from simple sentence clauses produced by the complex sentence processor. The highly technical terminology and the complex grammatical constructs that are present in the biomedical abstracts make the extraction task difficult, Even a simple sentence with a single verb can contain multiple and/or nested interactions. That's why our IE system is based on a deep parse tree structure presented by the LG and it considers a thorough case based analysis of contents of various syntactic roles of the sentences like their subjects (S), verbs (V), objects (O) and modifying phrases (M) as well as their linguistically significant and meaningful combinations like S-V-O, S-O, S-V-M or S-M, illustrated in Figure 4, for finding and extracting</Paragraph>
    </Section>
    <Section position="6" start_page="57" end_page="57" type="sub_section">
      <SectionTitle>
5.1 Role Type Matcher
</SectionTitle>
      <Paragraph position="0"> For each syntactic constituent of the sentence, the role type matcher identifies the type of each role as either 'Elementary', 'Partial' or 'Complete' based on its matching content, as presented in Table 1.</Paragraph>
    </Section>
    <Section position="7" start_page="57" end_page="58" type="sub_section">
      <SectionTitle>
Role Type Description
</SectionTitle>
      <Paragraph position="0"> Elementary If the role contains a Protein name or an interaction word.</Paragraph>
      <Paragraph position="1"> Partial If the role has a Protein name and an interaction word.</Paragraph>
      <Paragraph position="2"> Complete If the role has at least two Protein names and an interaction word.</Paragraph>
    </Section>
    <Section position="8" start_page="58" end_page="58" type="sub_section">
      <SectionTitle>
5.2 Interaction Word Tagger
</SectionTitle>
      <Paragraph position="0"> The words that match a biologically significant action between two gene/protein names are labeled as 'interaction words'. Our gazetteer for interaction words is derived from UMLS and WordNet</Paragraph>
    </Section>
    <Section position="9" start_page="58" end_page="58" type="sub_section">
      <SectionTitle>
5.3 Interaction Extractor (IE)
</SectionTitle>
      <Paragraph position="0"> IntEx interaction extractor works as follows. The input to IE is the preprocessed and typed simple clause structures. The IE algorithm progresses bottom up, starting from each syntactic role S, V or M, and expanding them using the lattice provided in Figure 4 until all 'Complete' singleton or composite role types are obtained.</Paragraph>
      <Paragraph position="1"> Consider the example shown in Figure 3, for the third sentence, the boundaries of the subject and the modifying phrase are identified and both are role typed as 'Elementary' using Table 1. Since the main verb is tagged as an interaction word, IE uses the S-V-M composite role from Figure 4 to find and extract the following complete interaction: {'The SAC 6 gene Protein', 'colocalizes', 'actin'}.</Paragraph>
      <Paragraph position="2"> 'Complete' roles also need to be analyzed in order to determine their voice as 'active' or 'passive'.</Paragraph>
      <Paragraph position="3"> Since there are only a small number of preposition combinations, such as of-by, from-to etc., that occur frequently within the clauses, they can be used to distinguish the agent and the theme of the interactions. null For example, in the sentence &amp;quot;The kinase phosphorylation of pRb by c-Abl in the gland could inhibit ku70&amp;quot;, the subject role is &amp;quot;The kinase phosphorylation of pRb by c-Abl in the gland&amp;quot;. Since the subject has at least two protein names and an interaction word it is 'complete'. By using the 'ofby' pattern (...&lt;Interaction-Word (action)&gt;... of ...&lt;theme&gt;...by ...&lt;agent&gt;...) the IE is able to extract the correct interaction {c-Abl, phosphorylation, pRb} from the subject role alone.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML