File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-3312_metho.xml
Size: 27,076 bytes
Last Modified: 2025-10-06 14:10:57
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-3312"> <Title>Postnominal Prepositional Phrase Attachment in Proteomics</Title> <Section position="4" start_page="82" end_page="84" type="metho"> <SectionTitle> 3 Approach </SectionTitle> <Paragraph position="0"> For this exploratory study we compiled two manually annotated corpora1, a smaller, targeted development corpus consisting of sentences referring to enzymes in ve articles, and a larger test corpus consisting of the full text of nine articles drawn from a wider set of topics. This bias in the data was set deliberately to test whether NPs referring to enzymes follow a distinct pattern. Our results suggest that the compiled heuristics are in fact not speci c to enzymes, but work with comparable performance for a much wider set of NPs.</Paragraph> <Paragraph position="1"> As our goal is semantic interpretation of NPs, only postnominal PPs were considered. A large number of these follow a very simple attachment principle right association.</Paragraph> <Paragraph position="2"> Right association (Kimball, 1973), or late closure, describes a preference for parses that result in the parse tree with the most right branches. Simply stated, right association assumes that new constituents are part of the closest possible constituent that is under construction. In the case of postnominal PPs, right association attaches each PP to the NP that immediately precedes it. An example where this strategy does fairly well is given below.</Paragraph> <Paragraph position="3"> The effect of hydrolysis of the hemicelluloses in the milled wood lignin on the molecular mass distribution was then examined. . .</Paragraph> <Paragraph position="4"> Notice that, except for the last PP, attachment to the preceding NP is correct. The last PP, on the molecular mass distribution, modi es the head NP effect. Another frequent pattern in our corpus is given below with a corresponding text fragment. In this pattern, the entire NP consists of one reaction fully described by several PPs that all attach to a nominalization in the head NP. Attachment according to this pattern is in direct opposition to right association.</Paragraph> <Paragraph position="5"> . . . the release of reducing sugars from carboxymethylcellulose by cellulase at 37 oC, pH 4.8. . .</Paragraph> <Paragraph position="6"> In general, the attachment behavior of a large percentage of PPs in the examined literature can be characterized by either right association or attachment to a nominalization. The preposition of a PP seems to be the main criterion for determining which attachment principle to apply. A few prepositions were observed to follow right association almost exclusively, while others show a strong af nity toward nominalizations, defaulting to right association only when no nominalization is available.</Paragraph> <Paragraph position="7"> These observations were implemented as attachment heuristics for the most frequently occurring PPs, as distinguished by their prepositions (see Table 1 for frequency data). These rules, as outlined below, account for 90% of all postnominal PPs in the corpus. The remaining 10%, for which no clear pattern could be found, are attached using right association. null heuristics and the baseline (right association) on development and test set.</Paragraph> <Paragraph position="8"> Right Association (of, from, for) PPs headed by of, from, and for attach almost exclusively according to right association. In particular, no violation of right association by of PPs has been found. The system, therefore, attaches any PP from this class to the NP immediately preceding it.</Paragraph> <Paragraph position="9"> Strong Nominalization Af nity (by, at) In contrast, by and at PPs attach almost exclusively to nominalizations. Only rarely have they been observed to attach to non-nominalization NPs. In most cases where no nominalizations are present in the NP, a PP of this class actually attaches to a preceding VP. Typical nominalization and VP attachments found in the corpus are exempli ed in the following two sentences.</Paragraph> <Paragraph position="10"> . . . the formation of stalk cells by culB[?] pkaR[?] cells decreased about threefold. . .</Paragraph> <Paragraph position="11"> . . . xylooligosaccharides were not detected in hydrolytic products from corn cell walls by TLC analysis.</Paragraph> <Paragraph position="12"> This attachment preference is implemented in the system as the heuristic for strong nominalization af nity. Given a PP from this class, the system rst attempts attachment to the closest nominalization to the left. If no such NP is found, the PP is assumed to attach to a VP.</Paragraph> <Paragraph position="13"> Weak Nominalization Af nity (in, with, as) In, with, and as PPs show similar af nity toward nominalizations. In fact, initially, these PPs were attached with the strong af nity heuristic. However, after further observation it became apparent that these PPs do often attach to non-nominalization NPs. A typical example for each of these possibilities is given as follows.</Paragraph> <Paragraph position="14"> . . . incubation of the substrate pullulan with protein fractions.</Paragraph> <Paragraph position="15"> The major form of beta-amylase in Arabidopsis. . .</Paragraph> <Paragraph position="16"> Here, the system rst attempts nominalization attachment. If no nominalizations are present in the NP, instead of defaulting to VP attachment, the PP is attached to the closest NP to its left that is not the object of an of PP. This behavior is intuitively consistent since in PPs are usually adjuncts to the main NP (which is usually an entity if not a nominalization) and are unlikely to modify any of the NP's modi ers.</Paragraph> <Paragraph position="17"> Effect on The nal heuristic encodes the frequent attachment of on PPs with NPs indicating effect, in uence, impact, etc. While this relationship seems intuitive and likely to occur in varied texts, it may be disproportionally frequent in proteomics texts. Nonetheless, the heuristic does have a strong basis in the examined literature. An example is provided below.</Paragraph> <Paragraph position="18"> . . . the effects of reduced b-amylase activity on seed formation and germination. . .</Paragraph> <Paragraph position="19"> The system checks NPs preceding an on PP for the closest occurrence of an effect NP. If no such NPs are found, right association is used.</Paragraph> </Section> <Section position="5" start_page="84" end_page="84" type="metho"> <SectionTitle> 4 System Overview </SectionTitle> <Paragraph position="0"> There are three main phases of processing that must occur before the PP attachment heuristics can be applied. These include preprocessing and two stages of NP chunking. Upon completion of these three phases, the PP attachment module is executed.</Paragraph> <Paragraph position="1"> The preprocessing phase consists of standard tokenization and part-of-speech tagging, as well as named entity recognition (and other term lookup) using gazetteer lists and simple transducers. Recognition is currently limited to enzymes, organisms, chemicals, (enzymological) activities, and measurements. A comprehensive enzyme list including synonyms was compiled from BRENDA2 and some limited organism lists3, including common abbreviations, were augmented based on organisms found in the development corpus. For recognition of substrates and products, some of the chemical entity lists from BioRAT (Corney et al., 2004) are used.</Paragraph> <Paragraph position="2"> Activity lists from BioRAT, with several enzymespeci c additions, are also used.</Paragraph> <Paragraph position="3"> The next phase of processing uses a chunker reported in (Bergler et al., 2003) and further developed for a related project. NP chunking is performed in two stages, using two separate context-free grammars and an Earley-type chart parser. No domain-speci c information is used in either of the grammars; recognized entities and terms are used only for improved tokenization. The rst stage chunks base NPs, without attachments. Here, the parser input is segmented into smaller sentence fragments to reduce ambiguity and processing time. The fragments are delimited by verbs, prepositions, and sentence boundaries, since none of these can occur within a base NP. In the second chunking stage, entire sentences are parsed to extract NPs containing conjunctions and PP attachments. At this stage, no attempt is made to determine the proper attachment structure of the PPs or to exclude postnominal PPs that should actually be attached to a preceding VP any PP that follows an NP has the potential to attach somewhere in the NP.</Paragraph> <Paragraph position="4"> The nal phase of processing is performed by the PP attachment module. Here, each postnominal PP is examined and attached according to the rule for its preposition. Only base NPs within the same NP are considered as possible attachment points. For the strong nominalization af nity heuristic, if no nominalization is found, the PP is assumed to attach to the closest preceding VP. For both nominalization af nity heuristics, the UMLS SPECIALIST Lexicon4 is used to determine whether the head noun of each possible attachment point is a nominalization.</Paragraph> </Section> <Section position="6" start_page="84" end_page="87" type="metho"> <SectionTitle> 5 Results & Analysis </SectionTitle> <Paragraph position="0"> The development corpus was compiled from ve articles retrieved from PubMed Central5 (PMC). The articles were the top-ranked results returned from ve separate queries6 using BioKI:Enzymes, a literature navigation tool (Bergler et al., 2006). Sentences containing enzymes were extracted and the remaining sentences were discarded. In total, 476 sentences yielding 830 postnominal PPs were manually annotated as the development corpus.</Paragraph> <Paragraph position="1"> Attachment accuracy on the development corpus is 88%. The accuracy and coverage of each rule is summarized in Table 2 and discussed in the following sections. Also, as a reference point for performance comparison, the system was tested using only the right association heuristic resulting in a baseline accuracy of 80%. The system performance is contrasted with the baseline and summarized for each To measure heuristic performance, the PP attachment heuristics were scored on manual NP and PP annotations. Thus all reported accuracy numbers reect performance of the heuristics alone, isolated from possible chunking errors. The PP attachment module is, however, designed for input from the chunker and does not handle constructs which the chunker does not provide (e.g. PP conjunctions and non-simple parenthetical NPs).</Paragraph> <Section position="1" start_page="85" end_page="85" type="sub_section"> <SectionTitle> 5.1 Right Association </SectionTitle> <Paragraph position="0"> The application of right association for PPs headed by of, for, and from resulted in correct attachment in 96.2% of their occurrences in the development corpus. Because this class of PPs is processed using the baseline heuristic without any re nements, it has no effect on overall system accuracy as compared to overall baseline accuracy. However, it does provide a clear delineation of the subset of PPs for which right association is a suf cient and optimal solution for attachment. Given the coverage of this class of PPs (62.8% of the corpus), it also provides an explanation for the relatively high baseline performance.</Paragraph> <Paragraph position="1"> Of PPs are attached with 99% accuracy.</Paragraph> <Paragraph position="2"> All errors involve attachment of PP conjunctions, such as . . . a search of the literature and of the GenBank database. . . , or attachment to NPs containing non-simple parenthetical statements, such as The synergy degree (the activities of XynA and cellulase cellulosome mixtures divided by the corresponding theoretical activities) of cellulase. . . . Sentences of these forms are not accounted for in the NP chunker, around which the PP attachment system was designed. Both scenarios re ect shortcomings in the NP grammars, not in the heuristic.</Paragraph> <Paragraph position="3"> For and from PPs are attached with 81% and 87% accuracy, respectively. The majority of the error here corresponds to PPs that should be attached to a VP. For example, attachment errors occurred both in the sentence . . . this was followed by exoglucanases liberating cellobiose from these nicks. . . and in the sentence . . . the reactions were stopped by placing the microtubes in boiling water for 2 to 3 min.</Paragraph> </Section> <Section position="2" start_page="85" end_page="86" type="sub_section"> <SectionTitle> 5.2 Strong Nominalization Af nity </SectionTitle> <Paragraph position="0"> The heuristic for strong nominalization af nity deals with only two types of PPs, those headed by the prepositions by and at, both of which occur with relatively low frequency in the development corpus.</Paragraph> <Paragraph position="1"> Accordingly, the heuristic's impact on the overall accuracy of the system is rather small. However, it affords the largest increase in accuracy for the PPs of its class. The heuristic correctly determines attachment with 87.5% accuracy.</Paragraph> <Paragraph position="2"> While these PPs account for a small portion of the corpus, they play a critical role in describing enzymological information. Speci cally, by PPs are most often used in the description of relationships between entities, as in the NP degradation of xylan networks between cellulose micro brils by xylanases , while at PPs often quantitatively indicate the condition under which observed behavior or experiments take place, as in the NP Incubation of the enzyme at 40 oC and pH 9.0 .</Paragraph> <Paragraph position="3"> The heuristic provides a strong performance increase over the baseline, correctly attaching 95.2% of by PPs in contrast to 23.8% with the baseline. In fact, only a single error occurred in attaching by PPs in the development corpus and the sentence in question, given below, appears to be ungrammatical in all of its possible interpretations.</Paragraph> <Paragraph position="4"> The TLC pattern of liberated cellooligosaccharides by mixtures of XynA cellulosomes and cellulase cellulosomes was similar to that caused by cellulase cellulosomes alone.</Paragraph> <Paragraph position="5"> A few other errors (e.g. typos, omission of words, and grammatically incorrect or ambiguous constructs) were observed in the development corpus.</Paragraph> <Paragraph position="6"> The extent of such errors and the degree to which they affect the results (either negatively or positively) is unknown. However, such errors are inescapable and any automated system is susceptible to their effects.</Paragraph> <Paragraph position="7"> Although no errors in by PP attachment were found in the development corpus, aside from the given problematic sentence, one that would be processed erroneously by the system was found manually in the GENIA Treebank7. It is given below to demonstrate a boundary case for this heuristic.</Paragraph> <Paragraph position="8"> . . . modulation of activity in B cells by human T-cell leukemia virus type I tax gene. . .</Paragraph> <Paragraph position="9"> Here, the system would attach the by PP to the closest nominalization activity, when in fact, the cor- null rect attachment is to the nominalization modulation.</Paragraph> <Paragraph position="10"> This error scenario is relevant to all of the PPs with nominalization af nity. A possible solution is to separate general nominalizations, such as activity and action, from more speci c ones, such as modulation, and to favor the latter type whenever possible. An experiment toward this end, with emphasis on in PPs, was performed with promising results. It is discussed in the following section.</Paragraph> <Paragraph position="11"> For at PPs, 81.5% accuracy was achieved, as compared to 18.5% with the baseline. The higher degree of error with at PPs is indicative of their more varied usage, requiring more contextual information for correct attachment. An example of typical variation is given in the following two sentences, both of which contain at PPs that the system incorrectly attached to the nominalization activity.</Paragraph> <Paragraph position="12"> The amylase exhibited maximal activity at pH 8.7 and 55 oC in the presence of 2.5 M NaCl.</Paragraph> <Paragraph position="13"> . . . Bacillus sp. strain IMD370 produced alkaline a-amylases with maxima for activity at pH 10.0.</Paragraph> <Paragraph position="14"> While both sentences report observed conditions for maximal enzyme activity using similar language, the attachment of the at PPs differs between them. In the rst sentence, the activity was exhibited at the given pH and temperature (VP attachment), but in the second sentence, the enzyme was not necessarily produced at the given pH (NP attachment) production may have occurred under different conditions from those reported for the activity maxima.</Paragraph> <Paragraph position="15"> For errors of this nature, it seems that employing semantic information about the preceding VP and possibly also the head NP would lead to more accurate attachment. There are, however, other similar errors where even the addition of such information does not immediately suggest the proper attachment.</Paragraph> </Section> <Section position="3" start_page="86" end_page="87" type="sub_section"> <SectionTitle> 5.3 Weak Nominalization Af nity </SectionTitle> <Paragraph position="0"> The weak nominalization af nity heuristic covers a large portion of the development corpus (18.2%).</Paragraph> <Paragraph position="1"> Overall system improvement over baseline attachment accuracy can be achieved through successful attachment of this class of PPs, particularly in and with PPs, which are the second and fourth most frequently used PPs in the development corpus, respectively. Unfortunately, the usage of these PPs is also perhaps the hardest to characterize. The heuristic achieves only 76.2% accuracy. Though noticeably better than right association alone, it is apparent that the behavior of this class of PPs cannot be entirely characterized by nominalization af nity.</Paragraph> <Paragraph position="2"> Accuracy of in PP attachment increased by 19.2% from the baseline with this heuristic. A signi cant source of attachment error is the problem of multiple nominalizations in the same NP. As mentioned above, splitting nominalizations into general and speci c classes may solve this problem. To explore this conjecture, the most common (particularly with in PPs) general nominalization, activity, was ignored when searching for nominalization attachment points. This resulted in a 3% increase in the accuracy for in PPs with no adverse effects on any of the other PPs with nominalization af nity.</Paragraph> <Paragraph position="3"> Despite further anticipated improvements from similar changes, attachment of in PPs stands to bene t the most from additional semantic information in the form of rules that encode containment semantics (i.e. which types of things can be contained in other types of things). Possible containment rules exist for the few semantic categories that are already implemented; enzymes, for instance, can be contained in organisms, but organisms are rarely contained in anything (though organisms can be said to be contained in their species, the relationship is rarely expressed as containment). Further analysis and more semantic categories are needed to formulate more generally applicable rules.</Paragraph> <Paragraph position="4"> With and as PPs are attached with 83.8% and 66.7% accuracy, respectively. All of the errors for these PPs involve incorrect attachment to an NP when the correct attachment is to a VP. Presented below are two sentences that provide examples of the particular dif culty of resolving these errors.</Paragraph> <Paragraph position="5"> The xylanase A . . . was expressed by E. coli with a C-terminal His tag from the vector pET29b. . .</Paragraph> <Paragraph position="6"> The pullulanase-type activity was identi ed as ZPU1 and the isoamylase-type activity as SU1.</Paragraph> <Paragraph position="7"> In the rst sentence, the with PP describes the method by which xylanase A was expressed; it does not restrict the organism in which the expression occurred. This distinction requires understanding the semantic relationship between C-terminal His tags, protein (or enzyme) expression, and E. coli.</Paragraph> <Paragraph position="8"> Namely, that His tags (polyhistidine-tags) are amino acid motifs used for puri cation of proteins, specifically proteins expressed in E. coli. Such information could only be obtained from a highly domain-speci c knowledge source. In the second sentence, the verb to which the as PP attaches is omitted. Accordingly, even if the semantics of verbs were used to help determine attachment, the system would need to recognize the ellipsis for correct attachment.</Paragraph> </Section> <Section position="4" start_page="87" end_page="87" type="sub_section"> <SectionTitle> 5.4 Effect on Heuristic </SectionTitle> <Paragraph position="0"> The attachment accuracy for on PPs is 84.6% using the effect on heuristic, a noticeable improvement over the 57.7% accuracy of the baseline. The few attachment errors for on PPs were varied and revealed no regularities suggesting future improvements.</Paragraph> </Section> <Section position="5" start_page="87" end_page="87" type="sub_section"> <SectionTitle> 5.5 Unclassi ed PPs </SectionTitle> <Paragraph position="0"> The remaining PPs, for which no heurisitics were implemented, represent 10% of the development corpus. The system attaches these PPs using right association, with accuracy of 60.7%. Most frequent are PPs headed by between, which are attached with 68.6% accuracy. A signi cant improvement is expected from a heuristic that attaches these PPs based on observations of semantic features in the corpus.</Paragraph> <Paragraph position="1"> Namely, that most of the NPs to which between PPs attach can be categorized as binary relations (e.g.</Paragraph> <Paragraph position="2"> bond, linkage, difference, synergy). This relational feature can be expressed in the head noun or in a prenominal modi er. In fact, more than 25% of between PPs in the development corpus attach to the NP synergistic effects (or some similar alternative), where between shows af nity toward the adjective synergistic, not the head noun effects, which does not attract between PP attachment on its own.</Paragraph> </Section> </Section> <Section position="7" start_page="87" end_page="88" type="metho"> <SectionTitle> 6 Evaluation on Varied Texts </SectionTitle> <Paragraph position="0"> To assess the general applicability of the heuristics to varied texts, the system was evaluated on a test corpus of an additional nine articles8 from PMC.</Paragraph> <Paragraph position="1"> The entire text, except the abstract and introduction, of each article was manually annotated, resulting in 1603 sentences with 3079 postnominal PPs.</Paragraph> <Paragraph position="2"> The system's overall attachment accuracy on this 8PMC query terms: metabolism, biosynthesis, proteolysis, peptidyltransferase, hexokinase, epimerase, laccase, ligase, dehydrogenase. null test data is 82%, comparable to that for the development enzymology data. The accuracy and coverage of each rule for the test data, as contrasted with the development set, is given in Table 2. The baseline heuristic achieved an accuracy of 77.5%. A comparative performance breakdown by preposition is given in Table 1.</Paragraph> <Paragraph position="3"> Overall, changes in the coverage and accuracy of the heuristics are much less pronounced than expected from the increase in size and variance of both subject matter and writing style between the development and test data. The only signi cant change in rule coverage is a slight increase in the number of unclassi ed PPs to 12.3%. These PPs are also more varied and the right-associative default heuristic is less applicable (49.5% accuracy in the test data vs.</Paragraph> <Paragraph position="4"> 60.7% in the development data). The largest contribution to this additional error stems from a doubling of the frequency of to PPs in the test corpus. Preliminary analysis of the corresponding errors suggests that these PPs would be much better suited to the strong nominalization af nity heuristic than the right association default. The error incurred over all unclassi ed PPs accounts for 1.4% of the accuracy difference between the development and test data. The larger number of these PPs also explains the smaller overall difference between the system and baseline performance.</Paragraph> <Paragraph position="5"> For PPs were observed to have more frequent VP attachment in the test data. In particular, for PPs with object NPs specifying a duration (or other measurement), as exempli ed below, attach almost exclusively to VPs and nominalizations.</Paragraph> <Paragraph position="6"> The sample was spun in a microfuge for 10 min. . .</Paragraph> <Paragraph position="7"> This behavior is also apparent in the development data, though in much smaller numbers. Applying the strong nominalization af nity heuristic to these PPs resulted in an increase of for PP attachment accuracy in the test corpus to 75.8% and an overall increase in accuracy of 1.0%.</Paragraph> <Paragraph position="8"> A similar pattern was observed for at PPs, where the pattern <CHEMICAL> at <CONCENTRATION> accounts for 25.6% of all at PP attachment errors and the majority of the performance decrease for the strong nominalization af nity heuristic between the two data sets. The remainder of the performance decrease for this heuristic is attributed to gaps in the</Paragraph> </Section> <Section position="8" start_page="88" end_page="88" type="metho"> <SectionTitle> UMLS SPECIALIST Lexicon. For instance, the un- </SectionTitle> <Paragraph position="0"> derlined head nouns in the following examples are not marked as nominalizations in the lexicon.</Paragraph> <Paragraph position="1"> The double mutant inhibited misreading by paromomycin . . .</Paragraph> <Paragraph position="2"> . . . the formation of stalk cells by culB[?] pkaR[?] cells. . .</Paragraph> <Paragraph position="3"> In our test corpus, these errors were only apparent in by PP attachment, but can potentially affect all nominalization-based attachment.</Paragraph> <Paragraph position="4"> Aside from the cases mentioned in this section, attachment trends in the test corpus are quite similar to those observed in the development corpus. Given the diversity in the test data, both in terms of subject matter (between articles) and writing style (between sections), the results suggest the suitability of our heuristics to proteomics texts in general.</Paragraph> </Section> class="xml-element"></Paper>