File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-1304_metho.xml

Size: 10,450 bytes

Last Modified: 2025-10-06 14:08:37

<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-1304">
  <Title>Enhancing Performance of Protein Name Recognizers Using Collocation</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Filtering Strategies
</SectionTitle>
    <Paragraph position="0"> For protein name recognition, rule-based systems and dictionary-based systems are usually complementary. Rule-based systems can recognize those protein names not listed in a dictionary, but some false entities may also pass at the same time. Dictionary-based systems can recognize those proteins in a dictionary, but the coverage is its major deficiency. In this section, we will employ collocates of proteins mined earlier to help identify the molecular entities. Yapex system (Olsson et al., 2002) is adopted to propose candidates, and collocates are served as restrictions to filter out less possible protein names.</Paragraph>
    <Paragraph position="1"> The following filtering strategies are proposed.</Paragraph>
    <Paragraph position="2"> Assume the candidate set M0 is the output generated by Yapex.</Paragraph>
    <Paragraph position="3"> null M1: For each candidate in M0, check if a collocate is found in its collocation window.</Paragraph>
    <Paragraph position="4"> If yes, tag the candidate as a protein name.</Paragraph>
    <Paragraph position="5"> Otherwise, discard it.</Paragraph>
    <Paragraph position="6"> null M2: Some of the collocates may be substrings of protein names. We relax the restriction in M1 as follows. If a collocate appears in the candidate or in the collocation window of the candidate, then tag the candidate as a protein name; otherwise, discard it.</Paragraph>
    <Paragraph position="7"> null M3: Some protein names may appear more than once in a document. They may not always co-occur with some collocate in each occurrence. In other words, the protein candidate and some collocates may co-occur in the first occurrence, the second occurrence, or even the last occurrence.</Paragraph>
    <Paragraph position="8"> We revise M1 and M2 as follows to capture this phenomenon. During checking if there exists a collocate co-occurring with a protein candidate, the candidate without any collocate is kept undecidable instead of definite no. After all the protein names are examined, those undecidable candidates may be considered as protein names when one of their co-occurrences containing any collocate.</Paragraph>
    <Paragraph position="9"> In other words, as long as a candidate has been confirmed once, it is assumed to be a protein throughout. In this way, there are two filtering alternatives M31 and M32 from M1 and M2, respectively.</Paragraph>
    <Paragraph position="10"> To get more objective evaluation, we utilized another corpus of 101 abstracts used by Yapex [http://www.sics.se/humle/projects/prothalt]. Using the test corpus and answer keys supported in Yapex project, the evaluation results on filtering strategies are listed in Table 1.</Paragraph>
    <Paragraph position="11">  Compared with the baseline model M0, the precision rates of all the four models using collocates were improved more than 8%. The recall rates of M1 and M2 decreased about 13%. Thus, the overall F-scores of M1 and M2 decreased about 2% compared to M0. In contrast, if the decision of tagging was deferred until all the information were considered, then the recall rate decreased only 2% and the overall F-scores of M31 and M32 increased 4% relative to M0. The best one, M32, improved the precision rate from 70.90% to 81.94%, and the F-score from 70.22% to 74.54%. That meets our expectation, i.e., to enhance the precision rate, but not to reduce the significant recall rate.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Integration Strategies
</SectionTitle>
    <Paragraph position="0"> Now we consider how to improve the recall rates.</Paragraph>
    <Paragraph position="1"> Integration strategies based on a hybrid concept are introduced. The basic idea is that different protein name taggers have their own specific features such that they can recognize some tagging objects according to their rules or recognition methods. Among the proposed protein names by different recognizers, there may exist some overlaps and some differences. In other words, a protein name recognizer may tag a protein name that another recognizer cannot identify, or both of them may accept certain common proteins. The integration strategies are used to select correct protein names proposed by multiple recognizers.</Paragraph>
    <Paragraph position="2"> In this study, we made experiments on Yapex and KeX because they are freely available on the web.</Paragraph>
    <Paragraph position="3"> Because protein candidates are proposed by two named entity extractors independently, they may be totally separated, totally overlap, overlapped in between, overlapped in the beginning, and overlapped in the end. Figure 2 demonstrates these five cases.</Paragraph>
    <Paragraph position="4"> The integration strategies shown as follows combine the results from two sources.</Paragraph>
    <Paragraph position="5"> null When the protein names produced from two recognizers are totally separated (i.e., type A), retain them as the protein candidates. This integration strategy postulates that one protein name recognizer may extract some proteins that another one cannot identify.</Paragraph>
    <Paragraph position="6"> null When the protein names produced from two recognizers are exactly the same (i.e., type B), retain them as the protein candidates. Because both taggers accept the same protein names, there must exist some special features that fit protein names.</Paragraph>
    <Paragraph position="7"> null When the protein names tagged by two taggers have partial overlap (i.e., types C, D and E), two additional integration strategies are employed, i.e., Yapex-based and KeX-based strategies. In the former strategy, we adopt protein names tagged by Yapex as candidates and discard the ones produced by KeX. In contrast, the names tagged by KeX are kept in the latter strategy. The integration strategy is made because each recognizer has its own characteristics, and we do not know which one is performed better in advance.</Paragraph>
    <Paragraph position="8"> Type A: totally separated The above integration strategies put together all the possible protein candidates except the ambiguous cases (i.e., types C, D and E). That tends to increase the recall rate. To avoid decreasing the precision rate, we also employ the collocates mentioned in Section 3 to filter out the less possible protein candidates. Furthermore, to objectively evaluate the performance of the proposed collocates, we employ the same strategies to the same test corpus with some terms suggested by human experts. Total 48 verbal keywords which were used to find the pathway of proteins are used and listed in Appendix B.</Paragraph>
    <Paragraph position="9"> Type B: totally overlap Type C: overlapped in between Type D: overlapped in the beginning Four sets of experiments were designed as follows for Yapex- and KeX-based integration strategies, respectively.</Paragraph>
    <Paragraph position="10"> Type E: overlapped in the end (1)YA and KA: Use the collocates automatically extracted in Section 3 to filter out the candidates as described in Section 4.</Paragraph>
    <Paragraph position="11"> (2)YB and KB: Use the terms suggested by human experts for the filtering strategies.</Paragraph>
    <Paragraph position="12"> Figure 2. Candidates Proposed by Two Systems (3)YA-C and KA-C: If Yapex and KeX recommend the same protein names (i.e., type B), regard them as protein names without consideration of collocates. Otherwise, use the collocates proposed in this study to make filtering. (4)YB-C and KB-C: Similar to (3) except that the collocates are replaced by the terms suggested by human experts.</Paragraph>
    <Paragraph position="13"> The experimental results are listed in Tables 2 and 3. The tendency M32&gt;M31&gt;M2&gt;M1 is still kept in the new experiments. The strategy of delaying the decision until clear evidence is found is workable. The performances of YA, YA-C, KA, and KA-C are better than the performances of the corresponding models (i.e., YB, YB-C, KB, and  KB-C). It shows that the set of collocates proposed by our system is more complete than the set of terms suggested by human experts.</Paragraph>
    <Paragraph position="14"> Compared with the recall rate of M0 in Table 1 (i.e., 69.53%), the recall rates of both Yapex- and KeX-based integration are increased, i.e., 77.52% and 70.60%, respectively. That matches our expectation. However, the precision rates are decreased more than the increase of recall rates. In particular, the F-score of KeX-based integration strategy is 4.70% worse than that of the baseline M0. It shows that KeX performed not well in this test set, so it cannot recommend good candidates in the integration stage. Moreover, the F-scores of M31 and M32 of YA and YA-C are better than that of M0 in Table 1. It reveals that Yapex performed better in this test corpus, so that we can enhance the performance by both the filtering and integration strategies. Nevertheless, the models in Tables 2 and 3 still cannot compete to M32 in</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
6 Concluding Remarks
</SectionTitle>
    <Paragraph position="0"> This paper shows a fully automatic way of mining collocates from scientific text in the protein domain, and employs them to improve the performance of protein name recognition successfully. The same approach can be extended to other domains like gene, DNA, RNA, drugs, and so on. The collocates extracted from a domain corpus are also important keywords for pathway discovery, so that a systematic way from basic named entities finding to complex relationships discovery can be established.</Paragraph>
    <Paragraph position="1"> Applying filtering strategy only demonstrates better performance than applying both filtering and integration strategies together in this paper. One of the possible reasons is that the adopted systems are similar, i.e., both systems are rule-based, and some heuristic steps used in one system are inherited from another. The effects of combining different types of protein name taggers, e.g., rule-based and corpus-based, will be investigated in the future.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML