File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/e06-1052_metho.xml
Size: 12,803 bytes
Last Modified: 2025-10-06 14:10:06
<?xml version="1.0" standalone="yes"?> <Paper uid="E06-1052"> <Title>Investigating a Generic Paraphrase-based Approach for Relation Extraction</Title> <Section position="4" start_page="410" end_page="410" type="metho"> <SectionTitle> 3 Assumed Configuration for RE Phenomenon Example </SectionTitle> <Paragraph position="0"> Passive form 'Y is activated by X' Apposition 'X activates its companion, Y' Conjunction 'X activates prot3 and Y' Set 'X activates two proteins, Y and Z' Relative clause 'X, which activates Y' Coordination 'X binds and activates Y' Transparent head 'X activates a fragment of Y' Co-reference 'X is a kinase, though it activates Y' strated for the normalized template 'X activate Y'. The general configuration assumed in this paper for RE is based on two main elements: a list of lexical-syntactic templates which entail the relation of interest and a syntactic matcher which identifies the template occurrences in sentences. The set of entailing templates may be collected either manually or automatically. We propose this configuration both as an algorithm for RE and as an evaluation scheme for paraphrase acquisition. The role of the syntactic matcher is to identifythedifferentsyntacticvariationsinwhichtem- null plates occur in sentences. Table 1 presents a list of generic syntactic phenomena that are known in the literature to relate to linguistic variability. A phenomenon which deserves a few words of explanation is the &quot;transparent head noun&quot; (Grishman et al., 1986; Fillmore et al., 2002). A transparent noun N1 typically occurs in constructs of the form 'N1 preposition N2' for which the syntactic relation involving N1, which is the head of the NP, applies to N2, the modifier. In the example in Table 1, 'fragment' is the transparent head noun while the relation 'activate' applies to Y as object.</Paragraph> </Section> <Section position="5" start_page="410" end_page="412" type="metho"> <SectionTitle> 4 Manual Data Analysis </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="410" end_page="410" type="sub_section"> <SectionTitle> 4.1 Protein Interaction Dataset </SectionTitle> <Paragraph position="0"> Bunescu et al. (2005) proposed a set of tasks regarding protein name and protein interaction extraction, for which they manually tagged about 200 Medline abstracts previously known to contain human protein interactions (a binary symmetric relation). Here we consider their RE task of extracting interacting protein pairs, given that the correct protein names have already been identified. All protein names are annotated in the given gold standard dataset, which includes 1147 annotated interacting protein pairs. Protein names are rather complex, and according to the annotation adoptedbyBunescuetal.(2005)canbesubstrings of other protein names (e.g., <prot> <prot> GITR </prot> ligand </prot>). In such cases, we considered only the longest names and protein pairs involving them. We also ignored all reflexive pairs, in which one protein is marked as interacting with itself. Altogether, 1052 interactions remained. All protein names were transformed into symbols of the type ProtN, where N is a number, which facilitates parsing.</Paragraph> <Paragraph position="1"> For development purposes, we randomly split the abstracts into a 60% development set (575 interactions) and a 40% test set (477 interactions).</Paragraph> </Section> <Section position="2" start_page="410" end_page="411" type="sub_section"> <SectionTitle> 4.2 Dataset analysis </SectionTitle> <Paragraph position="0"> In order to analyze the potential of our approach, two of the authors manually annotated the 575 interacting protein pairs in the development set. For each pair the annotators annotated whether it can be identified using only template-based matching, assuming an ideal implementation of the configuration of Section 3. If it can, the normalized form of the template connecting the two proteins was annotated as well. The normalized template form is based on the active form of the verb, stripped of the syntactic phenomena listed in Table 1. Additionally, the relevant syntactic phenomena from Table 1 were annotated for each template instance.</Paragraph> <Paragraph position="1"> Table 2 provides several example annotations.</Paragraph> <Paragraph position="2"> A Kappa value of 0.85 (nearly perfect agreement) was measured for the agreement between the two annotators, regarding whether a protein pair can be identified using the template-based method. Additionally, the annotators agreed on 96% of the normalized templates that should be used for the matching. Finally, the annotators agreed on at least 96% of the cases for each syntactic phenomenon except transparent heads, for which they agreed on 91% of the cases. This high level of agreement indicates both that template-based matching is a well defined task and that normalized template form and its syntactic variations are well defined notions.</Paragraph> <Paragraph position="3"> Several interesting statistics arise from the an-</Paragraph> </Section> <Section position="3" start_page="411" end_page="412" type="sub_section"> <SectionTitle> Sentence Annotation </SectionTitle> <Paragraph position="0"> We have crystallized a complex between human FGF1 and a two-domain extracellular fragment of human FGFR2.</Paragraph> <Paragraph position="1"> *template: 'complex between X and Y' *transparent head: 'fragment of X' CD30 and its counter-receptor CD30 ligand (CD30L) are members of the TNF-receptor / TNFalpha superfamily and function to regulate lymphocyte survival and differentiation. *template: 'X's counter-receptor Y' *apposition *co-reference iCdi1, a human G1 and S phase protein phosphatase that associates with Cdk2.</Paragraph> <Paragraph position="2"> *template: 'X associate with Y' *relative clause notation. First,93%oftheinteractingproteinpairs (537/575) can be potentially identified using the template-based approach, if the relevant templates are provided. This is a very promising finding, suggesting that the template-based approach may provide most of the requested information. We term these 537 pairs as template-based pairs. The remaining pairs are usually expressed by complex inference or at a discourse level.</Paragraph> <Paragraph position="3"> phenomenon within template-based pairs (537). Second, for 66% of the template-based pairs at least one syntactic phenomenon was annotated. Table4containstheoccurrencepercentageofeach phenomenon in the development set. These results show the need for a powerful syntactic matcher on top of high performance template acquisition, in order to correctly match a template in a sentence. Third, 175 different normalized templates were identified. For each template we counted its template instances, the number of times the template occurred, counting only occurrences that express an interaction of a protein pair. In total, we counted 341 template instances for all 175 templates. Interestingly, 50% of the template instances (184/341) are instances of the 21 most frequent templates. This shows that, though protein interaction can be expressed in many ways, writers tend to choose from among just a few common expressions. Table 3 presents the most frequent templates. Table 5 presents the minimal number of templates required to obtain the range of different recall levels.</Paragraph> <Paragraph position="4"> Furthermore, we grouped template variants that are based on morphological derivations (e.g. 'X interact with Y' and 'X Y interaction') and found that 4 groups, 'X interact with Y', 'X bind to Y', 'X associate with Y' and 'X complex with Y', together with their morphological derivations, cover 45% of the template instances. This shows the need to handle generic lexicalsyntacticphenomena, andparticularlymorphological based variations, separately from the acquisition of normalized lexical syntactic templates. To conclude, this analysis indicates that the template-based approach provides very high coverage for this RE dataset, and a small number of normalized templates already provides significant recall. However, it is important to (a) develop a model for morphological-based template variations (e.g. as encoded in Nomlex (Macleod et al., )), and (b) apply accurate parsing and develop syntactic matching models to recognize the rather complex variations of template instantiations in text. Finally, we note that our particular figures are specific to this dataset and the biological abstracts domain. However, the annotation and analysis methodologies are general and are suggested as highly effective tools for further research.</Paragraph> </Section> </Section> <Section position="6" start_page="412" end_page="412" type="metho"> <SectionTitle> 5 Implemented Prototype </SectionTitle> <Paragraph position="0"> This section describes our initial implementation of the approach in Section 3.</Paragraph> <Section position="1" start_page="412" end_page="412" type="sub_section"> <SectionTitle> 5.1 TEASE </SectionTitle> <Paragraph position="0"> The TEASE algorithm (Szpektor et al., 2004) is an unsupervised method for acquiring entailment relations from the Web for a given input template.</Paragraph> <Paragraph position="1"> In this paper we use TEASE for entailment relation acquisition since it processes an input template in a completely unsupervised manner and due to its broad domain coverage obtained from the Web. The reported percentage of correct output templates for TEASE is 44%.</Paragraph> <Paragraph position="2"> The TEASE algorithm consists of 3 steps, demonstrated in Table 6. TEASE first retrieves from the Web sentences containing the input template. From these sentences it extracts variable instantiations, termed anchor-sets, which are identified as being characteristic for the input template based on statistical criteria (first column in Table 6). Characteristic anchor-sets are assumed to uniquely identify a specific event or fact. Thus, any template that appears with such an anchor-set is assumed to have an entailment relationship with the input template. Next, TEASE retrieves from the Web a corpus S of sentences that contain the characteristic anchor-sets (second column), hoping to find occurrences of these anchor-sets within templates other than the original input template.</Paragraph> <Paragraph position="3"> Finally, TEASE parses S and extracts templates that are assumed to entail or be entailed by the input template. Such templates are identified as maximal most general sub-graphs that contain the anchor sets' positions (third column in Table 6).</Paragraph> <Paragraph position="4"> Each learned template is ranked by number of occurrences in S.</Paragraph> </Section> <Section position="2" start_page="412" end_page="412" type="sub_section"> <SectionTitle> 5.2 Transformation-based Graph Matcher Inordertoidentifyinstancesofentailingtemplates insentenceswedevelopedasyntacticmatcherthat </SectionTitle> <Paragraph position="0"> is based on transformations rules. The matcher processes a sentence in 3 steps: 1) parsing the sentence with the Minipar parser, obtaining a dependency graph2; 2) matching each template against the sentence dependency graph; 3) extracting candidatetermpairsthatmatchthetemplatevariables. null A template is considered directly matched in a sentence if it appears as a sub-graph in the sentence dependency graph, with its variables instantiated. To further address the syntactic phenomena listed in Table 1 we created a set of hand-crafted parser-dependent transformation rules, which account for the different ways in which syntactic relationships may be realized in a sentence. A transformation rule maps the left hand side of the rule, which strictly matches a sub-graph of the given template, to the right hand side of the rule, which strictly matches a sub-graph of the sentence graph. If a rule matches, the template sub-graph is mapped accordingly into the sentence graph.</Paragraph> <Paragraph position="1"> For example, to match the syntactic template 'X(N) subj- activate(V) obj- Y(N)' (POS tags are in parentheses) in the sentence &quot;Prot1 detected and activated Prot2&quot; (see Figure 1) we should handle the coordination phenomenon.</Paragraph> <Paragraph position="2"> The matcher uses the transformation rule</Paragraph> <Paragraph position="4"> to overcome the syntactic differences. In this example Var1 matches the verb 'activate', Word matches the verb 'detect' and the syntactic relations for Word are mapped to the ones for Var1.</Paragraph> <Paragraph position="5"> Thus, we can infer that the subject and object relations of 'detect' are also related to 'activate'.</Paragraph> </Section> </Section> class="xml-element"></Paper>