File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-1634_metho.xml
Size: 20,281 bytes
Last Modified: 2025-10-06 14:10:46
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-1634"> <Title>Automatic Construction of Predicate-argument Structure Patterns for Biomedical Information Extraction</Title> <Section position="5" start_page="0" end_page="284" type="metho"> <SectionTitle> 2 Full Parsing 2.1 Necessity for Full Parsing </SectionTitle> <Paragraph position="0"> A technique that many previous approaches have used is shallow parsing (Koike et al., 2003; Yao et al., 2004; Zhou et al., 2005). Their assertion is Distance [?]1 means protein word has been annotated as interacting with itself (e.g. &quot;actin polymerization&quot;). Distance 0 means words of the interacting proteins are directly next to one another. Multi-word protein names are concatenated as long as they do not cross tags to annotate proteins.</Paragraph> <Paragraph position="1"> that shallow parsers are more robust and would be sufficient for IE. However, their claims that shallow parsers are sufficient, or that full parsers do not contribute to application tasks, have not been fully proved by experimental results.</Paragraph> <Paragraph position="2"> Zhou et al. (2005) argued that most information useful for IE derived from full parsing was shallow. However, they only used dependency trees and paths on full parse trees in their experiment. Such structures did not include information of semantic subjects/objects, which full parsing can recognize. Additionally, most relations they extracted from the ACE corpus (Linguistic Data Consortium, 2005) on broadcasts and newswires were within very short word-distance (70% where two entities are embedded in each other or separated by at most one word), and therefore shallow information was beneficial. However, Table 1 shows that the word distance is long between interacting protein names annotated on the AImed corpus (Bunescu and Mooney, 2004), and we have totreatlong-distancerelationsforinformationlike protein-protein interactions.</Paragraph> <Paragraph position="3"> Full parsing is more effective for acquiring generalized data from long-length words than shallow parsing. The sentences at left in Figure 1 exemplify the advantages of full parsing. The gerund &quot;activating&quot; in the last sentence takes a non-local semantic subject &quot;ENTITY1&quot;, and shallow parsing cannot recognize this relation because &quot;ENTITY1&quot; and&quot;activating&quot;areindifferentphrases. Fullparsing, on the other hand, can identify both the subjectofthewholesentenceandthesemanticsubject null of &quot;activating&quot; have been shared.</Paragraph> <Section position="1" start_page="284" end_page="284" type="sub_section"> <SectionTitle> 2.2 Predicate-argument Structures </SectionTitle> <Paragraph position="0"> We applied Enju (Tsujii Laboratory, 2005a) as a full parser which outputs predicate-argument structures (PASs). PASs are well normalized forms that represent syntactic relations. Enju is based on Head-driven Phrase Structure Grammar (Sag and Wasow, 1999), and it has been trained on the Penn Treebank (PTB) (Marcus et al., 1994) and a biomedical corpus, the GENIA Treebank (GTB) (Tsujii Laboratory, 2005b). We used a part-of-speech (POS) tagger trained on the GENIA corpus (Tsujii Laboratory, 2005b) as a preprocessor for Enju. On predicate-argument relations, Enju achieved 88.0% precision and 87.2% recall on PTB, and 87.1% precision and 85.4% recall on GTB.</Paragraph> <Paragraph position="1"> The illustration at right in Figure 1 is a PAS example, which represents the relation between &quot;activate&quot;, &quot;ENTITY1&quot; and &quot;ENTITY2&quot; of all sentences to the left. The predicate and its arguments are words converted to their base forms, augmented by their POSs. The arrows denote the connections from predicates to their arguments and the types of arguments are indicated as arrow labels, i.e., ARGn (n = 1,2,...), MOD. For example, the semantic subject of a transitive verb is ARG1 and the semantic object is ARG2.</Paragraph> <Paragraph position="2"> What is important here is, thanks to the strong normalization of syntactic variations, that we can expect that the construction algorithm for extracting patterns that works on PASs will need a much smaller training corpus than those working on surface-word sequences. Furthermore, because of the reduced diversity of surface-word sequences at the PAS level, any IE system at this level should demonstrate improved recall.</Paragraph> </Section> </Section> <Section position="6" start_page="284" end_page="285" type="metho"> <SectionTitle> 3 Related Work </SectionTitle> <Paragraph position="0"> Sudo et al. (2003), Culotta and Sorensen (2004) and Bunescu and Mooney (2005) acquired sub-structures derived from dependency trees as extraction patterns for IE in general domains. Their approaches were similar to our approach using PASs derived from full parsing. However, one problem with their systems is that they could not treat non-local dependencies such as semantic subjects of gerund constructions (discussed in Section 2), and thus rules acquired from the constructions were partial.</Paragraph> <Paragraph position="1"> Bunescu and Mooney (2006) also learned extraction patterns for protein-protein interactions by SVM with a generalized subsequence kernel.</Paragraph> <Paragraph position="2"> Their patterns are sequences of words, POSs, entity types, etc., and they heuristically restricted length and word positions of the patterns. Al- null ENTITY1 recognizes and activates ENTITY2.</Paragraph> <Paragraph position="3"> ENTITY2 activated by ENTITY1 are not well characterized. The herpesvirus encodes a functional ENTITY1 that activates human ENTITY2. ENTITY1 can functionally cooperate to synergistically activate ENTITY2. The ENTITY1 plays key roles by activating ENTITY2.</Paragraph> <Paragraph position="5"> though they achieved about 60% precision and about 40% recall, these heuristic restrictions could not be guaranteed to be applied to other IE tasks.</Paragraph> <Paragraph position="6"> Hao et al. (2005) learned extraction patterns for protein-protein interactions as sequences of words, POSs, entity tags and gaps by dynamic programming, and reduced/merged them using a minimum description length-based algorithm. Although they achieved 79.8% precision and 59.5% recall, sentences in their test corpus have too many positive instances and some of the patterns they claimed to have been successfully constructed went against linguistic or biomedical intuition. (e.g. &quot;ENTITY1 and interacts with ENTITY2&quot; should be replaced by a more general pattern because they aimed to reduce the number of patterns.)</Paragraph> </Section> <Section position="7" start_page="285" end_page="288" type="metho"> <SectionTitle> 4 Method </SectionTitle> <Paragraph position="0"> We automatically construct patterns to extract protein-protein interactions from an annotated training corpus. The corpus needs to be tagged to denote which protein words are interacting pairs.</Paragraph> <Paragraph position="1"> We follow five steps in constructing extraction patterns from the training corpus. (1) Sentences in the training corpus are parsed into PASs and we extract raw patterns from the PASs. (2) We divide the raw patterns to generate both combination and fragmental patterns. Because obtained patterns include inappropriate ones (wrongly generated or too general), (3) we apply both kinds of patterns to PASs of sentences in the training corpus, (4) calculate the scores for matching results of combination patterns, and (5) make a prediction model with SVM using these matching results and scores.</Paragraph> <Paragraph position="2"> We extract pairs of interacting proteins from a target text in the actual IE phase, in three steps.</Paragraph> <Paragraph position="3"> (1) Sentences in the target corpus are parsed into PASs. (2) We apply both kinds of extraction patterns to these PASs and (3) calculate scores for combination pattern matching. (4) We use the prediction model to predict interacting pairs.ENTITY1 ENTITY2CD4/NNprotein/NNinteract/VBwith/INpolymorphic/JJregion/NNof/INMHCII/NN</Paragraph> <Paragraph position="5"/> <Section position="1" start_page="285" end_page="286" type="sub_section"> <SectionTitle> 4.1 Full Parsing and Extraction of Raw Patterns </SectionTitle> <Paragraph position="0"> As the first step in both the construction phase and application phase of extraction patterns, we parse sentences into PASs using Enju.1 We label all PASs of the protein names as protein PASs.</Paragraph> <Paragraph position="1"> After parsing, we extract the smallest set of PASs, which connect words that denote interacting proteins, and make it a raw pattern. We take the same method to extract and refine raw patterns as Yakushiji et al. (2005). Connecting means we can trace predicate-argument relations from one protein word to the other in an interacting pair.</Paragraph> <Paragraph position="2"> The procedure to obtain a raw pattern (p0,...,pn) is as follows: predicate(p): PASs that have p as their argument argument(p): PASs that p has as its arguments 1. pi = p0 is the PAS of a word correspondent to one of interacting proteins, and we obtain candidates of the raw pattern as follows: 1-1. If pi is of the word of the other interacting protein, (p0,...,pi) is a candidate of the raw pattern.</Paragraph> <Paragraph position="3"> 1-2. If not, make pattern candidates for each pi+1 [?] predicate(pi) [?] argument(pi) [?] {p0,...,pi} by returning to 1-1.</Paragraph> <Paragraph position="4"> 3. Substitutevariables(ENTITY1, ENTITY2)for the predicates of PASs correspondent to the interacting proteins.</Paragraph> <Paragraph position="5"> The lower part of Figure 2 shows an example of the extraction of a raw pattern. &quot;CD4&quot; and &quot;MHCII&quot; are words representing interacting proteins. First, we set the PAS of &quot;CD4&quot; as p0. argument(p0) includes the PAS of &quot;protein&quot;, and we set it as p1 (in other words, tracing the arrow (1)). Next, predicate(p1)includes the PAS of &quot;interact&quot; (tracing the arrow (2) back), so we set it as p2. We continue similarly until we reach the PAS of &quot;MHCII&quot; (p6). The result of the extracted raw pattern is the set of p0,...,p6 with substituting variables ENTITY1 and ENTITY2 for &quot;CD4&quot; and &quot;MHCII&quot;.</Paragraph> <Paragraph position="6"> There are some cases where an extracted raw pattern is not appropriate and we need to refine it. One case is when unnecessary coordinations/parentheses are included in the pattern, e.g. two interactions are described in a combined representation (&quot;ENTITY1 binds this protein and ENTITY2&quot;). Another is when two interacting proteins are connected directly by a conjunction or only one protein participates in an interaction. In such cases, we refine patterns by unfolding of coordinations/parentheses and extension of patterns, respectively. We have omitted detailed explanations because of space limitations. The details are described in the work of Yakushiji et al. (2005).</Paragraph> </Section> <Section position="2" start_page="286" end_page="287" type="sub_section"> <SectionTitle> 4.2 Division of Patterns </SectionTitle> <Paragraph position="0"> Division for generating combination patterns is basedonobservationofYakushijietal.(2005)that there are many cases where combinations of verbs and certain nouns form IE patterns. In the work of Yakushiji et al. (2005), we divided only patterns that include only one verb. We have extended the division process to also treat nominal patterns or patterns that include more than one verb.</Paragraph> <Paragraph position="1"> Combination patterns are not appropriate for utilizingindividualwordinformationbecausethey are always used in rather strictly combined ways.</Paragraph> <Paragraph position="2"> Therefore we have newly introduced fragmental patterns which consist of independent PASs from raw patterns, in order to use individual word information for higher recall.</Paragraph> <Paragraph position="3"> Raw patterns are divided into some components and the components are combined to con- null struct combination patterns according to types of the division. There are three types of division of raw patterns for generating combination patterns. These are: (a) Two-entity Division (a-1) Entity-Main-Entity Division (a-2) Main-Entity-Entity Division (b) Single-entity Division, and (c) No Division (Naive Patterns).</Paragraph> <Paragraph position="4"> Most raw patterns, where entities are at both ends of the patterns, are divided into Entity-Main-Entity. Main-Entity-Entity are for the cases where there are PASs other than entities at the ends of the patterns (e.g. &quot;interaction between ENTITY1 and ENTITY2&quot;). Single-entity is a special Main-Entity-Entity for interactions with only one participant (e.g. &quot;ENTITY1 dimerization&quot;). There is an example of Entity-Main-Entity division in Figure 3. First, the main component from the raw pattern is the syntactic head PAS of the raw pattern. If the raw pattern corresponds to a sentence, the syntactic head PAS is the PAS of the main verb. We underspecify the arguments of the main component, to enable them to unify with the PASs of any words with the same POSs. Next, if there are PASs of prepositions connecting to the main component, they become prep components.</Paragraph> <Paragraph position="5"> If there is no PAS of a preposition next to the main component on the connecting link from the main component to an entity, we make the pseudo PAS of a null preposition the prep component. The left prep component ($X) in Figure 3 is a pseudo PAS of a null preposition. We also underspecify the arguments of prep components. Finally, the remaining two parts, which are typically noun phrases, of the raw pattern become entity components. PASs corresponding to the entities of the original pair are labeled as only unifiable with the entities of other pairs.</Paragraph> <Paragraph position="6"> Main-Entity-Entity division is similar, except we distinguish only one prep component as a double-prep component and the PAS of the coordinate conjunction between entities becomes the coord component. Single-entity division is similar to Main-Entity-Entity division and the difference is that single-entity division produces no coord and one entity component. Naive patterns are patternswithoutdivision,wherenodivisioncanbe applied (e.g. &quot;ENTITY1/NN in/IN complexes/NN with/IN ENTITY2/NN&quot;).</Paragraph> <Paragraph position="7"> All PASs on boundaries of components are labeled to determine which PAS on a boundary of another component can be unified. Labels are represented by subscriptions in Figure 3. These restrictions oncomponentconnectionareusedinthe step of constructing combination patterns.</Paragraph> <Paragraph position="8"> Constructing combination patterns by combining components is equal to reconstructing original raw patterns with the original combination of components, or constructing new raw patterns with new combinations of components. For example, an Entity-Main-Entity pattern is constructed by combination of any main, any two prep and any two entity components. Actually, this construction process by combination is executed in the pattern matching step. That is, we do not off-line construct all possible combination patterns from the components and only construct the combination patterns that are able to match the target.</Paragraph> <Paragraph position="9"> A raw pattern is splitted into individual PASs and each PAS becomes a fragmental pattern. We also prepare underspecified patterns where one or more of the arguments of the original are underspecified, i.e., are able to match any words of the same POSs and the same label of protein/notprotein. We underspecify the PASs of entities in fragmental patterns to enable them to unify with any PASs with the same POSs and a protein label, althoughincombinationpatternsweretainthe PASs of entities as only unifiable with entities of pairs. This is because fragmental patterns are designed to be less strict than combination patterns.</Paragraph> </Section> <Section position="3" start_page="287" end_page="287" type="sub_section"> <SectionTitle> 4.3 Pattern Matching </SectionTitle> <Paragraph position="0"> Matching of combination patterns is executed as a process to match and combine combination pattern components according to their division types (Entity-Main-Entity, Main-Entity-Entity, Single-entity and No Division). Fragmental matching is matching all fragmental patterns to PASs derived from sentences.</Paragraph> </Section> <Section position="4" start_page="287" end_page="288" type="sub_section"> <SectionTitle> 4.4 Scoring for Combination Matching </SectionTitle> <Paragraph position="0"> We next calculate the score of each combination matchingtoestimatetheadequacyofthecombinationofcomponents. Thisisbecausenewcombination of components may form inadequate patterns. (e.g. &quot;ENTITY1 be ENTITY2&quot; can be formed of components from &quot;ENTITY1 be ENTITY2 receptor&quot;.) Scores are derived from the results of combination matching to the source training corpus.</Paragraph> <Paragraph position="1"> We apply the combination patterns to the training corpus, and count pairs of True Positives (TP) and False Positives (FP). The scores are calculated basically by the following formula: Score = TP/(TP + FP)+ a x TP This formula is based on the precision of the pattern on the training corpus, i.e., an estimated precision on a test corpus. a works for smoothing, that is, to accept only patterns of large TP when FP = 0. a is set as 0.01 empirically. The formula is similar to the Apriori algorithm (Agrawal and Srikant, 1995) that learns association rules from a database. The first term corresponds to the confidence of the algorithm, and the second term corresponds to the support.</Paragraph> <Paragraph position="2"> For patterns where TP = FP = 0, which are not matched to PASs in the training corpus (i.e., newly produced by combinations of components), we estimates TP' and FP' by using the confidence of the main and entity components. Thisisbecausemainandentitycomponents tend to contain pattern meanings, whereas prep, double-prep and coord components are rather functional. The formulas to calculate the scores for all cases are:</Paragraph> <Paragraph position="4"> Combination Pattern (1) Combination of components in combination matching (2) Main component in combination matching (3) Entity components in combination matching (4) Score for combination matching (SCORE) Fragmental Pattern (5) Matched fragmental patterns (6) Number of PASs of example that are not matched * TP: number of TPs by the combination of components * TPmain:two: sum of TPs by two-entity combinations that include the same main component * TPmain:single: sum of TPs by single-entity combina null tions that include the same main component * TPentityi: sum of TPs by combinations that include the same entity component which is not the straight entity component * FPx: similar to TPx but TP is replaced by FP The entity component &quot;ENTITY/NN&quot;, which onlyconsistsofthePASofanentity,addsnoinformation to combinations of components. We call this component a straight entity component and exclude its effect from the scores.</Paragraph> </Section> <Section position="5" start_page="288" end_page="288" type="sub_section"> <SectionTitle> 4.5 Construction of Prediction Model </SectionTitle> <Paragraph position="0"> We use an SVM to learn a prediction model to determine whether a new protein pair is interacting.</Paragraph> <Paragraph position="1"> We used SV Mlight (Joachims, 1999) with an rbf kernel, which is known as the best kernel for most tasks. The prediction model is based on the features of Table 2.</Paragraph> </Section> <Section position="6" start_page="288" end_page="288" type="sub_section"> <SectionTitle> Full Parsing </SectionTitle> <Paragraph position="0"/> </Section> </Section> class="xml-element"></Paper>