File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-1204_metho.xml

Size: 10,202 bytes

Last Modified: 2025-10-06 14:09:12

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-1204">
  <Title>Discovering patterns to extract protein-protein interactions from full</Title>
  <Section position="4" start_page="24" end_page="24" type="metho">
    <SectionTitle>
3 System overview
</SectionTitle>
    <Paragraph position="0"> Our system uses the framework of PathwayFinder (Yao et al., 2004). It consists of several modular components, as shown in Figure 3.</Paragraph>
    <Paragraph position="1"> The external resource required in our method is a dictionary of protein names, where about 60,000 items are collected from both databases of PathwayFinder and several web databases, such as TrEMBL, SWISSPROT (O'Donovan et al., 2002), and SGD (Cherry et al., 1997), including many synonyms. The training corpus contains about 1200 sentences which will be explained with details in the next section. Patterns generated at the training phase are stored in the pattern database.</Paragraph>
    <Paragraph position="2"> For an input sentence, firstly some filtering rules are adapted to remove useless expressions at the pre-processing phase. For example, remove citations, such as '[1]', and listing figures, such as '(1)'. Then protein names in the sentence are identified according to the protein name dictionary and the names are replaced with a unique label.</Paragraph>
    <Paragraph position="3"> Subsequently, the sentence is part-of-speech tagged by Brill's tagger (Brill et al., 1995), where the tag of protein names is changed to tag PTN.</Paragraph>
    <Paragraph position="4"> Last, since a sequence of tags is obtained, it can be added into the corpus at the training phase or it can be used by the matching algorithm at the testing phase.</Paragraph>
    <Paragraph position="5"> Because the pattern acquisition algorithm is aligning sequences of tags, the accuracy of part-of-speech tagging is crucial. However, Brill's tagger only obtained overall 83% accuracy for biomedical texts. This is because biomedical texts contain many unknown words. Here we propose a simple and effective approach called pre-tagging strategy to improve the accuracy, just as the method used by (Huang et al., 2004).</Paragraph>
  </Section>
  <Section position="5" start_page="24" end_page="25" type="metho">
    <SectionTitle>
4 Results
</SectionTitle>
    <Paragraph position="0"> Our evaluation experiments are made up of three parts: mining verbs for patterns, extracting patterns and evaluating precision and recall rates.</Paragraph>
    <Section position="1" start_page="24" end_page="25" type="sub_section">
      <SectionTitle>
4.1 Mining verbs
</SectionTitle>
      <Paragraph position="0"> The algorithm shown in Figure 1 is performed on the whole corpus and one more filtering rule as follows, is used, besides those in Table 2: If the pattern has no verb tag, reject it.</Paragraph>
      <Paragraph position="1"> With this rule, only patterns that have verbs are extracted. Here the threshold d is set to 10 to obtain high accurate verbs for the subsequent Input: a pattern set  , if not, go to step d); ii. Fill all data in mVector ; iii. Determine to accept or reject the match according to decision rules. If reject, go to step d); iv. Add X ai to the result set R; 2. Output R.</Paragraph>
      <Paragraph position="2"> Input: two parameters P and V 1. If cMatch [?] cLen, reject the match; 2. if cPtn &gt; P, reject the match; 3. if cVb &gt; V, reject the match;  experiments. Totally 94 verbs are extracted from 367 verbs for describing interactions. Note that different tense verbs that have the same base form are counted as different ones. There are false positives which do not define interactions semantically at all, such as 'affect', 'infect', 'localize', amounting to 16. Hence the accuracy is 83.0%. These verbs and their variants, particularly the gerund and noun form, (obtained from an English lexicon) are added into a list of filtering words, which is named as FWL (Filtering Word List). For example, for verb 'inhibit', its variants including 'inhibition', 'inhibiting', 'inhibited' and 'inhibitor' are added into FWL. At the current phase, we add all verbs into FWL, including false positives because we think these verbs are also helpful to understand pathway networks between proteins.</Paragraph>
    </Section>
    <Section position="2" start_page="25" end_page="25" type="sub_section">
      <SectionTitle>
4.2 Extracting patterns
</SectionTitle>
      <Paragraph position="0"> Pattern generating algorithm is performed on the whole corpus with FWL. The threshold d is 5 here.</Paragraph>
      <Paragraph position="1"> The filtering rules in Table 2, plus the following rule, are applied.</Paragraph>
      <Paragraph position="2"> If a pattern has any verb or noun that is not in FWL, reject it.</Paragraph>
      <Paragraph position="3"> This ensures that the patterns have a good form and all their words are valid. In other word, this rule guarantees that the main verbs or nouns in every pattern exactly describe protein interactions. The experiment runs on about 1200 sentences, with threshold d=5, and 134 patterns are obtained.</Paragraph>
      <Paragraph position="4"> Some of them are listed in Figure 4.</Paragraph>
    </Section>
    <Section position="3" start_page="25" end_page="25" type="sub_section">
      <SectionTitle>
4.3 Evaluating precision and recall rates
</SectionTitle>
      <Paragraph position="0"> In this part, three tests are performed. The first test uses 383 sentences that only contain keyword interact and its variants. 293 of them are used to extract patterns and the rest are tested. The second one uses 329 sentences that only contain key word bind and its variants. 250 of them are used to generate patterns and the rest are tested. The third one uses 1205 sentences with all keywords, where 1020 are used to generate patterns, the rest for test.</Paragraph>
      <Paragraph position="1"> As described before, we do not exclude those verbs such as 'affect', 'infect' and so on, therefore relations between proteins defined by these verbs or nouns are thought to be interactions. Note that the testing and training sentences are randomly partitioned, and they do not overlap in all these tests. The results are shown in Table 4. Some matching examples are shown in Figure 5. Simple sentences as sen1-2 are matched by only one pattern. But it is more common that several patterns may match one sentence at different positions, as in sen3-4. In examples sen5, the same pattern matches repeatedly at different positions since we used a 'multiple matches' algorithm.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="25" end_page="25" type="metho">
    <SectionTitle>
Keywords
</SectionTitle>
    <Paragraph position="0"/>
  </Section>
  <Section position="7" start_page="25" end_page="25" type="metho">
    <SectionTitle>
5 Discussion
</SectionTitle>
    <Paragraph position="0"> We have proposed a new method for automatically generating patterns and extract protein interactions. In contrast, our method outperforms the previous methods in two main aspects: first, it automatically mines patterns from a set of sentences whose protein names are identified; second, it is competent to process long and complicated sentences from full texts.</Paragraph>
    <Paragraph position="1"> In our method, a threshold d is used to control both the number of patterns and the generalization power of patterns. Although infrequent patterns are filtered by a small threshold, a glance to these patterns is meaningful. For example, on 293 sentences containing keyword 'interact' and its variants, patterns whose count equals one are shown in Figure 6. Among the results, some are reasonable, such as 'PTN VBZ IN PTN IN PTN '  ).</Paragraph>
    <Paragraph position="2"> These kinds of patterns are rejected because of both insufficient training data and infrequently used expressions in natural language texts. Some patterns are not accurate, such as 'NNS IN PTN PTN PTN ', because there must be a coordinating conjunction between the three continuous protein names, otherwise it will cause many errors. Some patterns are even wrong, such as 'PTN NN PTN ' because there are never such segment 'protein  ), are not precise because the last filtering rule in Table 2 is used. Nevertheless, these patterns can be filtered out by the threshold. However, how to evaluate and maintain patterns becomes a real problem. For example, when the pattern generating algorithm is applied on about 1200 sentences, with a threshold d=0, approximate 800 patterns are generated, most of which appeared only once in the corpus. It is necessary to reduce such large amount of patterns. A MDL-based algorithm that measures the confidence of each pattern and maintains them without human intervention is under development.</Paragraph>
    <Paragraph position="3"> Because our matching algorithm utilizes part-of-speech tags, and our patterns do not contain any adjective (JJ), interactions defined by adjectives, such as 'inducible' and 'inhibitable', cannot be extracted correctly by our method currently.</Paragraph>
    <Paragraph position="4">  are separated by a semicolon. For simplicity, words in a pattern are partially listed.  This can be demonstrated by the following sentence, where words in bold are protein names. &amp;quot;The class II proteins are expressed constitutively on B-cells and EBV-transformed B-cells, and are inducible by IFN-gamma on a wide variety of cell types.&amp;quot; In this sentence, interaction between class II proteins and IFN-gamma is defined by an adjective inducible (tagged as JJ) does not match any pattern. To solve this problem, we are considering using word stemming and morpheme recognition to convert adjectives into their corresponding verbs with context.</Paragraph>
    <Paragraph position="5"> By analyzing our experimental results, We find that the current matching algorithm is not optimal and causes approximately one-third of total errors. This partially derives from the simple decision rules used in the matching algorithm. These rules may work well for some texts but partially fail for others because the natural language texts are multifarious. With these considerations, a more accurate and complicated matching algorithm is under development.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML