File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/w04-0711_intro.xml
Size: 15,949 bytes
Last Modified: 2025-10-06 14:02:31
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-0711"> <Title>BioAR: Anaphora Resolution for Relating Protein Names to Proteome Database Entries</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> 2 Methods </SectionTitle> <Paragraph position="0"> BioAR identi es the antecedents of anaphoric expressions that appear in the results of BioIE and annotates the protein-referring phrases with Swiss-Prot entries. The system rst locates pronouns, noun phrases with determiners (DNPs), and biological interactions as the candidates of anaphoric expressions. Table 3 shows the statistics of these anaphoric expressions.4 The rest of the system is implemented in the following four steps: 1) pronoun resolution, 2) resolution of anaphoric DNPs, 3) restoration of missing arguments in the biological interactions, and 4) grounding the protein-referring phrases with Swiss-Prot entries.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Pronoun resolution </SectionTitle> <Paragraph position="0"> We adopt the centering theory of Grosz et al. (1995) for the anaphora resolution of pronouns. In particular, we follow the observation that the entities which have already been mentioned and are more central than others tend to be referred back by pronouns subsequently. For example, the candidate antecedent in the sentential subject is preferred to that in the sentential object (cf. Table 4).</Paragraph> <Paragraph position="1"> As for possessive pronouns such as its and their, we have found that the antecedents of these possessive pronouns are mostly located in the same or preceding sentences and that possessive pronouns can be classi ed into the following two types according to the sentential locations of their antecedents, 3There are 232 noun phrases which can be associated with Swiss-Prot entries, among 1,645 noun phrases in 516 biological interactions extracted by BioIE from a subset of yeast corpus. 4We have counted the anaphoric expressions among 1,645 noun phrases in the subset of yeast corpus.</Paragraph> <Paragraph position="2"> (4) Finally, SpNAC can bind to X-junctions that are already bound by a tetramer of the Escherichia coli RuvA protein, indicating that it interacts with only one face of the junc- null where 1) the antecedent of a possessive pronoun is the protein name which is nearest to the left of the possessive pronoun in the same sentence and 2) the antecedent of another possessive pronoun is the left-most protein name in the subject phrase of the same or preceding sentence (cf. Table 5). We have also found that the local context of a possessive pronoun of the second type mostly shows syntactic parallelism with that of its antecedent, as in the two they of the second example in Table 5, while that of the rst type does not show parallelism where the antecedents of such possessive pronouns are mostly the protein names nearest to the left of the possessive pronouns.5 Since the antecedents of possessive pronouns of the second type can be detected with the patterns that encode the parallelism between the local context of a possessive pronoun and that of its antecedent in the same sentence (cf. Table 6),6 we have set the protein names, those nearest to the left of the possessive pronouns in the same sentences, as the default antecedents of possessive pronouns and utilized the patterns, such as those in Table 6, in recognizing the possessive pronouns of the second type and in locating their antecedents.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Noun phrase resolution </SectionTitle> <Paragraph position="0"> In the process of resolving anaphoric noun phrases, BioAR rst locates the noun phrases with determiners (DNPs), especially those with de nites (i.e. the) and demonstratives (i.e. this, these, and those), as 5Among the 1,000 biological interactions, there are 31 possessive pronouns of the rst type and 17 possessive pronouns of the second type.</Paragraph> <Paragraph position="1"> 6POSS indicates a possessive pronoun; ANT indicates its antecedent; NP which follows POSS indicates the rest of the noun phrase which starts with POSS; and BeV indicates a beverb. VB, VBN, and PP are POS tags, indicating main verbs, past particles, and prepositions, respectively. 'Aa5 B' indicates that either A or B should occur. 'a6a7a6a7a6 ' can be matched to any sequence of words.</Paragraph> <Paragraph position="2"> (5) Using the Yeast Two-Hybrid system and further in vitro and in vivo studies, we identied the regulatory beta-subunit of casein kinase II (CKII), which speci cally binds to the cytoplasmic domain of CD163 and its isoforms. (PMID:11298324) (6) F-box proteins are the substrate-recognition components of SCF (Skp1-Cullin-F-box protein) ubiquitin-protein ligases. They bind the SCF constant catalytic core by means of the F-box motif interacting with Skp1, and they bind substrates through their variable protein-protein interaction domains.</Paragraph> <Paragraph position="3"> (PMID:11099048) Table 5: Possessive pronoun resolution examples 1. viaa8 througha8 due to POSS NP 2. ANT BeV VBN a3a4a3a4a3 and VBN PP POSS NP 3. ANT BeV VBN and POSS NP VBN PP 4. ANT BeV VBN a3a4a3a4a3 and POSS NP BeV VBN 5. VB that ANT VB a3a4a3a4a3 ,a8 and that POSS NP 6. ANT VB a3a4a3a4a3 , and POSS NP VB 7. ANT's NP VB a3a4a3a4a3 and POSS NP VB the candidates of anaphoric noun phrases.7 Among the noun phrases with de nites, the noun phrases that do not have antecedents in the context, i.e. non-anaphoric DNPs, mostly belong to the classes in Table 7.8 9 The system lters out those non-anaphoric DNPs belonging to those classes in Table 7, by utilizing a list of cellular component names, a list of species names, and the patterns in Table 7 which represent the internal structures of some non-anaphoric DNPs. We have also developed modules to identify appositions and acronyms in order to lter out remaining non-anaphoric DNPs.</Paragraph> <Paragraph position="4"> BioAR scores each candidate antecedent of an 7We also deal with other anaphoric noun phrases with 'both' or 'either', as in 'both proteins' and 'either protein'. 8GENE, PROTEIN, and DOMAIN indicate a gene name, a protein name, and a generic term indicating protein domain such as domain and subunit, respectively. DEFINITE indicates the de nite article the.</Paragraph> <Paragraph position="5"> logical interactions.</Paragraph> <Paragraph position="6"> 1. (39) DNP modi ed by a prepositional phrase or a relative clause (Ex. the Cterminal of AF9) 2. (24) DNP of the pattern 'DEFINITE GENE protein' (Ex. the E6 protein) 3. (16) DNP with appositive structure (Ex.</Paragraph> <Paragraph position="7"> the yeast transcriptional activator Gcn4) 4. (10) DNP ending with acronyms (Ex. the retinoid X receptor (RXR)) 5. (6) DNP of the pattern 'DEFINITE PROTEIN DOMAIN' (Ex. the DNA-PK catalytic subunit) 6. (4) DNP indicating a cellular component (Ex. the nucleus) 7. (2) DNP indicating a species name (Ex.</Paragraph> <Paragraph position="8"> the yeast Saccharomyces cerevisiae) anaphoric DNP with various salience measures and identi es the candidate antecedent with the highest score as the antecedent of the anaphoric DNP (cf. Castano et al. (2002)). For example, the system assigns penalties to the candidate antecedents whose numbers do not agree with those of anaphoric DNPs. Among the candidate antecedents of anaphoric DNPs, the candidate antecedents in the sentential subjects are preferred to those in the sentential objects or other noun phrases, following the centering theory (Grosz et al., 1995). We have also adopted salience measures to score each candidate antecedent according to the morphological, syntactic, and semantic characteristics of candidate antecedents (cf. Castano et al. (2002)). For example, when a DNP refers to a protein, its candidate antecedents which refer to protein domains get negative scores, and when a DNP refers to a protein domain, its candidate antecedents which refer to protein domains get positive scores. Furthermore, when a DNP refers to an enzyme, its candidate antecedents which end with '-ase' get positive scores.</Paragraph> <Paragraph position="9"> In the process of resolving the anaphoric DNPs referring to protein domains, the system identi es the proteins which contain the domains referred to by the anaphoric expressions. We have constructed several syntactic patterns which describe the rela1. DOMAIN ofa8 in PROTEIN 2. PROTEIN BeV NN composed of DOMAIN 3. PROTEIN BeV NN comprising DOMAIN 4. PROTEIN contain DOMAIN 5. the PROTEIN DOMAIN tionships between proteins and their domains as exempli ed in Table 8.</Paragraph> <Paragraph position="10"> The system locates the coordinate noun phrases with conjunction items such as 'and', 'or', and 'as well as' as the candidate antecedents of plural anaphoric expressions. The system also locates the proteins in the same protein family in the same document, as in MEK1 and MEK2, as the candidate antecedent of a plural anaphoric expression such as these MEKs (PMID:11134045).</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.3 Biological interaction resolution </SectionTitle> <Paragraph position="0"> BioAR also restores some of the missing arguments of interaction keywords by utilizing the context. When one or more syntactic arguments of biological interactions in the results of BioIE are elided, it is essential to identify the antecedents of the omitted arguments of the interactions, or null anaphora, as well. We have focused on resolving the missing arguments of nominal interaction keywords, such as interaction, association, binding, and coimmunoprecipitate,10 based on the observation that those keywords mostly represent protein-protein interactions, and thus their omitted arguments refer to proteins or protein domains in the previous context.</Paragraph> <Paragraph position="1"> In case only one argument of an interaction keyword is elided as in the rst example in Table 2, the proteins in the sentential subjects are preferred as antecedents to those in other noun phrases of the sentences which contain the interaction keyword. In case both arguments of an interaction keyword are elided as in the second example in Table 2, both the sentences, whose main verbs are in the verbal form 10The interaction keywords of interest, interaction, association, binding, and co-immunoprecipitate, indicate physical binding between two proteins, and thus they can be replaced with one another. In addition to them, the interaction keywords phosphorylation and translocation also often indicate protein- null protein interactions.</Paragraph> <Paragraph position="2"> 1. interaction of A with B 2. association of A with B 3. co-immunoprecipitation of A with B 4. binding of A to B 5. interaction betweena8 among A and B 6. association betweena8 among A and B 7. co-immunoprecipitation betweena8 among A and B 8. binding betweena8 among A and B (7) Interactions among the three MADS domain proteins were con rmed by in vitro experiments using GST-fused OsMADS1 expressed in Escherichia coli and in vitro translated proteins of OsMADS14 and -15. a3a4a3a4a3 While the K domain was essential for protein-protein interaction, a region preceded by the K domain augmented this interaction. (PMID:11197326) Table 10: An example antecedent of a nominal interaction keyword of the interaction keyword, and the noun phrases of the patterns in Table 9, whose headwords are the same as the interaction keyword, can be the candidate antecedents of the interaction keyword with its two missing arguments. Table 10 shows an example antecedent with a nominal interaction keyword.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.4 Protein name grounding </SectionTitle> <Paragraph position="0"> We have constructed around 0.7 million gene and protein names from the gene name (GN) and description (DE) elds of Swiss-Prot in order to recognize protein names in the literature. We have also developed several patterns to deal with the variations of protein names (cf. Table 11). Table 12 shows several examples of grounding protein names with Swiss-Prot entries.11 Taking into account the fact that many Swiss-Prot entries actually indicate certain domains of bigger proteins, for example Casein kinase II beta chain (KC2B YEAST) and Ribonuclease P protein 11The terms of the form A B, where B indicates the species information, are Swiss-Prot entries.</Paragraph> <Paragraph position="1"> component (RPM2 YEAST), BioAR grounds the phrases in the results of BioIE, which refer to protein domains, with the descriptions of Swiss-Prot entries, by converting those phrases into the structures as utilized by Swiss-Prot. For example, the phrase the regulatory beta-subunit of casein kinase II (CKII) can be grounded with KC2B YEAST, and the phrase the individual protein subunits of eukaryotic RNase P with RPM2 YEAST. Furthermore, the information about the domains of a protein is sometimes described in the SUBUNIT eld of Swiss-Prot. For example, the protein domain name the RNA subunit of RNase P can be grounded with RPM1 in the SUBUNIT eld of RPM2 YEAST, i.e.</Paragraph> <Paragraph position="2"> Consists of a RNA moiety (RPM1) and the protein component (RPM2). Both are necessary for full enzymatic activity. We leave the problem of looking up the SUBUNIT eld of Swiss-Prot as future work.</Paragraph> <Paragraph position="3"> Since a protein name can be grounded with multiple Swiss-Prot entries as shown in Table 12, BioAR tries to choose only one Swiss-Prot entry, the most appropriate one for the protein name among the candidate entries, by identifying the species of the protein from the context (cf. Hachey et al. (2004)).</Paragraph> <Paragraph position="4"> For example, while the protein name Rpg1p/Tif32p can be grounded with two Swiss-Prot entries, or a9 IF3A SCHPO, IF3A YEAST a10 , the noun phrase Saccharomyces cerevisiae Rpg1p/Tif32p should be grounded only with IF3A YEAST. Similarily, the system grounds the protein name Sla2p (8) The yeast two-hybrid system was used to screen for proteins that interact in vivo with Saccharomyces cerevisiae Rpg1p/Tif32p, the large subunit of the translation initiation factor 3 core complex (eIF3). Eight positive clones encoding portions of the SLA2/END4/MOP2 gene were isolated.</Paragraph> <Paragraph position="5"> Subsequent deletion analysis of Sla2p showed that amino acids 318-373 were essential for the two-hybrid protein-protein interaction. (PMID:11302750) Table 13: An annotation example for the necessity of species information only with SLA2 YEAST among candidate Swiss-Prot entries, or a9 SLA2 HUMAN, SLA2 MOUSE, SLA2 YEASTa10 , when the protein name occurs together with the species name Saccharomyces cerevisiae in the same abstract as in Table 13.</Paragraph> <Paragraph position="6"> In summary, BioAR rst locates anaphoric noun phrases, such as pronouns and anaphoric DNPs, and interaction keywords that appear in the results of BioIE, while it lters out non-anaphoric DNPs and the interaction keywords with two explicit syntactic arguments. The system identi es the antecedents of pronouns by utilizing patterns for parallelism and by following the observation in the centering theory. The system identi es the antecedents of anaphoric DNPs by utilizing various salience measures. In particular, the system identi es the proteins which contain the protein domains referred to by anaphoric expressions. The system restores the missing arguments of biological interactions from the context. Finally, the system grounds the protein-referring phrases in the results of BioIE with the most appropriate Swiss-Prot entry or entries.</Paragraph> </Section> </Section> class="xml-element"></Paper>