File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-0711_metho.xml

Size: 5,529 bytes

Last Modified: 2025-10-06 14:09:07

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-0711">
  <Title>BioAR: Anaphora Resolution for Relating Protein Names to Proteome Database Entries</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Experimental results
</SectionTitle>
    <Paragraph position="0"> We have developed BioAR with a training corpus consisting of 7,570 biological interactions that are extracted by BioIE from 1,505 MEDLINE abstracts on yeast (cf. Kim and Park (2004)). BioAR takes 24 seconds to process 1,645 biological interactions in the training corpus. We have constructed a test corpus which is extracted from MEDLINE with a different MeSH term, or topoisomerase inhibitors.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
SOURCE
PMID 10022855
</SectionTitle>
    <Paragraph position="0"> Sentence Gadd45 could potentially mediate this effect by destabilizing histone-DNA interactions since it was found to interact directly with the four core histones.</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
INTERACTION
</SectionTitle>
    <Paragraph position="0"> The test corpus includes 120 unseen biological interactions extracted by BioIE. Table 15 shows the experimental results of the modules of BioAR on the test corpus.12 Table 14 shows an example result of BioAR.</Paragraph>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Discussion
</SectionTitle>
    <Paragraph position="0"> We have analyzed the errors from each module of BioAR. All the incorrect antecedents of pronouns 12While the missing arguments of biological interactions often occur in the training corpus, there was only one missing argument in the test corpus, which is correctly restored by BioAR. This result is included into those of noun phrase resolution.</Paragraph>
    <Paragraph position="1"> Moreover, the rules and patterns utilized by BioAR show a low coverage in the test corpus. It would be helpful to utilize a machine-learning method to construct such rules and patterns from the training corpus, though there are few available anaphora-tagged corpora.</Paragraph>
    <Paragraph position="2"> (10) These triterpenoids were not only mammalian DNA polymerase inhibitors but also inhibitors of DNA topoisomerases I and II even though the enzymic characteristics of DNA polymerases and DNA topoisomerases, including their modes of action, amino acid sequences and three-dimensional structures, differed markedly.</Paragraph>
    <Paragraph position="3"> ... Because the three-dimensional structures of fomitellic acids were shown by computer simulation to be very similar to that of ursolic acid, the DNA-binding sites of both enzymes, which compete for the inhibitors, might be very similar.</Paragraph>
    <Paragraph position="4"> (PMID:10970789) Table 16: Incorrect resolution example of pronoun resolution module in the test corpus produced by the pronoun resolution module are due to incorrect named entity recognition, as in the incorrectly identi ed named entity DNA double-strand from the phrase DNA double-strand break (DSB) and -II' in topo-I or -II. This problem can be dealt with by a domain-speci c POS tagger and a named entity recognizer. Further semantic analysis with the help of the context is needed to deal with the errors of noun phrase resolution module. For example, these triterpenoids in Table 16 are inhibitors, and thus it can be a candidate antecedent of the anaphoric DNP the inhibitors.</Paragraph>
    <Paragraph position="5"> In the process of protein name grounding, BioAR grounds 8 abbreviations among 15 incorrectly grounded protein-referring phrases with irrelevant Swiss-Prot entries. Furthermore, among 32 protein-referring phrases not grounded by BioAR, 14 phrases are the same as the string topoisomerase where the string always indicates DNA topoisomerase in the corpus of topoisomerase inhibitors. To address this problem, we need domain-speci c knowledge, which we leave as future work.</Paragraph>
    <Paragraph position="6"> Castano et al. (2002) presented a knowledge-poor method to utilize salience measures, including partsof-speech, positions of the candidate antecedents, agreements and lexical features. While the method reportedly shows a relatively high performance of 77% precision and 71% recall, we note that the method is unable to deal with domain-speci c anaphora resolution, for example the task of identifying the proteins which contain the protein domains referred to by anaphoric expressions.</Paragraph>
    <Paragraph position="7"> Leidner et al. (2003) presented the method of grounding spatial named entities by utilizing two minimality heuristics, that is, that of assuming one referent per discourse and that of selecting the smallest bounding region in geographical maps.</Paragraph>
    <Paragraph position="8"> Hachey et al. (2004) presented a method for grounding gene names with respect to gene database identiers by dealing with various kinds of term variations and by removing incorrect candidate identi ers with statistical methods and heuristics. These methods are similar to BioAR in that they also aim to ground the phrases in texts with respect to the entities in the real world. However, BioAR further contributes to biomedical named entity grounding by dealing with the relationships between proteins and their domains and by identifying the species information of protein names from the context.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML