File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-1104_metho.xml
Size: 22,264 bytes
Last Modified: 2025-10-06 14:10:22
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-1104"> <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics A Composite Kernel to Extract Relations between Entities with both Flat and Structured Features</Title> <Section position="5" start_page="825" end_page="827" type="metho"> <SectionTitle> 3 Composite Kernel for Relation Ex- </SectionTitle> <Paragraph position="0"> traction In this section, we define the composite kernel and study the effective representation of a relation instance.</Paragraph> <Section position="1" start_page="825" end_page="827" type="sub_section"> <SectionTitle> 3.1 Composite Kernel </SectionTitle> <Paragraph position="0"> Our composite kernel consists of an entity kernel and a convolution parse tree kernel. To our knowledge, convolution kernels have not been explored for relation extraction.</Paragraph> <Paragraph position="1"> (1) Entity Kernel: The ACE 2003 data defines four entity features: entity headword, entity type and subtype (only for GPE), and mention type while the ACE 2004 data makes some modifications and introduces a new feature &quot;LDC mention type&quot;. Our statistics on the ACE data reveals that the entity features impose a strong constraint on relation types. Therefore, we design a linear kernel to explicitly capture such features: 12 1 2</Paragraph> <Paragraph position="3"> means the i th entity of a relation instance, and</Paragraph> <Paragraph position="5"> entity feature, and the function (,)C ** returns 1 if the two feature values are identical and 0 otherwise. (,)</Paragraph> <Paragraph position="7"> turns the number of feature values in common of two entities.</Paragraph> <Paragraph position="8"> (2) Convolution Parse Tree Kernel: A convolution kernel aims to capture structured information in terms of substructures. Here we use the same convolution parse tree kernel as described in Collins and Duffy (2001) for syntactic parsing and Moschitti (2004) for semantic role labeling.</Paragraph> <Paragraph position="9"> Generally, we can represent a parse tree T by a vector of integer counts of each sub-tree type (regardless of its ancestors):</Paragraph> <Paragraph position="11"> (T) is the occurrence number of the i th sub-tree type (subtree i ) in T. Since the number of different sub-trees is exponential with the parse tree size, it is computationally infeasible to directly use the feature vector ()Tph . To solve this computational issue, Collins and Duffy (2001) proposed the following parse tree kernel to calculate the dot product between the above high dimensional vectors implicitly. that is 1 iff the subtree i occurs with root at node n and zero otherwise, and (, )nn[?] is the number of the common subtrees rooted at n</Paragraph> <Paragraph position="13"> (, )nn[?] can be computed by the following recursive rules: child of node n andl (0<l <1) is the decay factor in order to make the kernel value less variable with respect to the subtree sizes. In addition, the recursive rule (3) holds because given two nodes with the same children, one can construct common sub-trees using these children and common sub-trees of further offspring.</Paragraph> <Paragraph position="14"> The parse tree kernel counts the number of common sub-trees as the syntactic similarity measure between two relation instances. The time complexity for computing this kernel is</Paragraph> <Paragraph position="16"> In this paper, two composite kernels are defined by combing the above two individual kernels in the following ways: is the coefficient. Evaluation on the development set shows that this composite kernel yields the best performance when a is set to 0.4.</Paragraph> <Paragraph position="17"> 2) Polynomial expansion:</Paragraph> <Paragraph position="19"> is the polynomial expansion of (,)K ** with degree d=2, i.e.</Paragraph> <Paragraph position="21"> KK** **=+, and a is the coefficient. Evaluation on the development set shows that this composite kernel yields the best performance when a is set to 0.23.</Paragraph> <Paragraph position="22"> The polynomial expansion aims to explore the entity bi-gram features, esp. the combined features from the first and second entities, respectively. In addition, due to the different scales of the values of the two individual kernels, they are normalized before combination. This can avoid one kernel value being overwhelmed by that of another one.</Paragraph> <Paragraph position="23"> The entity kernel formulated by eqn. (1) is a proper kernel since it simply calculates the dot product of the entity feature vectors. The tree kernel formulated by eqn. (3) is proven to be a proper kernel (Collins and Duffy, 2001). Since kernel function set is closed under normalization, polynomial expansion and linear combination (Scholkopf and Smola, 2001), the two composite kernels are also proper kernels.</Paragraph> <Paragraph position="24"> A kernel (, )K x y can be normalized by dividing it by</Paragraph> <Paragraph position="26"/> </Section> <Section position="2" start_page="827" end_page="827" type="sub_section"> <SectionTitle> 3.2 Relation Instance Spaces </SectionTitle> <Paragraph position="0"> A relation instance is encapsulated by a parse tree. Thus, it is critical to understand which portion of a parse tree is important in the kernel calculation. We study five cases as shown in Fig.1.</Paragraph> <Paragraph position="1"> (1) Minimum Complete Tree (MCT): the complete sub-tree rooted by the nearest common ancestor of the two entities under consideration. (2) Path-enclosed Tree (PT): the smallest com- null mon sub-tree including the two entities. In other words, the sub-tree is enclosed by the shortest path linking the two entities in the parse tree (this path is also commonly-used as the path tree feature in the feature-based methods).</Paragraph> <Paragraph position="2"> (3) Context-Sensitive Path Tree (CPT): the PT extended with the 1 st left word of entity 1 and the 1 st right word of entity 2.</Paragraph> <Paragraph position="3"> (4) Flattened Path-enclosed Tree (FPT): the PT with the single in and out arcs of non-terminal nodes (except POS nodes) removed. (5) Flattened CPT (FCPT): the CPT with the single in and out arcs of non-terminal nodes (except POS nodes) removed.</Paragraph> <Paragraph position="4"> Fig. 1 illustrates different representations of an example relation instance. T for clarity. The only difference between MCT and PT lies in that MCT does not allow partial production rules (for example, NPPP is a partial production rule while NPNP+PP is an entire production rule in the top of T ). For instance, only the most-right child in the most-left sub-tree [NP [CD 200] [JJ domestic] [E1-PER ...]] of T</Paragraph> <Paragraph position="6"> we can evaluate the effect of sub-trees with partial production rules as shown in T and the necessity of keeping the whole left and right context sub-trees as shown in T context-sensitive. This is to evaluate whether the limited context information in CPT can boost performance. FPT in T fits to 200 domestic partners of their own workers in New York&quot;, where the phrase type &quot;E1-PER&quot; denotes that the current node is the 1 st entity with type &quot;PERSON&quot;, and likewise for the others. The relation instance is excerpted from the ACE 2003 corpus, where a relation &quot;SOCIAL.Other-Personal&quot; exists between entities &quot;partners&quot; (PER) and &quot;workers&quot; (PER). We use Charniak's parser (Charniak, 2001) to parse the example sentence. To save space, the FCPT is not shown here. 828</Paragraph> </Section> </Section> <Section position="6" start_page="827" end_page="829" type="metho"> <SectionTitle> 4 Experiments </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="827" end_page="827" type="sub_section"> <SectionTitle> 4.1 Experimental Setting </SectionTitle> <Paragraph position="0"> Data: We use the English portion of both the ACE 2003 and 2004 corpora from LDC in our experiments. In the ACE 2003 data, the training set consists of 674 documents and 9683 relation instances while the test set consists of 97 documents and 1386 relation instances. The ACE 2003 data defines 5 entity types, 5 major relation types and 24 relation subtypes. The ACE 2004 data contains 451 documents and 5702 relation instances. It redefines 7 entity types, 7 major relation types and 23 subtypes. Since Zhao and Grishman (2005) use a 5-fold cross-validation on a subset of the 2004 data (newswire and broadcast news domains, containing 348 documents and 4400 relation instances), for comparison, we use the same setting (5-fold cross-validation on the same subset of the 2004 data, but the 5 partitions may not be the same) for the ACE 2004 data. Both corpora are parsed using Charniak's parser (Charniak, 2001). We iterate over all pairs of entity mentions occurring in the same sentence to generate potential relation instances. In this paper, we only measure the performance of relation extraction models on &quot;true&quot; mentions with &quot;true&quot; chaining of coreference (i.e. as annotated by LDC annotators).</Paragraph> <Paragraph position="1"> Implementation: We formalize relation extraction as a multi-class classification problem. SVM is selected as our classifier. We adopt the one vs.</Paragraph> <Paragraph position="2"> others strategy and select the one with the largest margin as the final answer. The training parameters are chosen using cross-validation (C=2.4 (SVM); l =0.4(tree kernel)). In our implementation, we use the binary SVMLight (Joachims, 1998) and Tree Kernel Tools (Moschitti, 2004).</Paragraph> <Paragraph position="3"> Precision (P), Recall (R) and F-measure (F) are adopted to measure the performance.</Paragraph> </Section> <Section position="2" start_page="827" end_page="829" type="sub_section"> <SectionTitle> 4.2 Experimental Results </SectionTitle> <Paragraph position="0"> In this subsection, we report the experiments of different kernel setups for different purposes.</Paragraph> <Paragraph position="1"> (1) Tree Kernel only over Different Relation Instance Spaces: In order to better study the impact of the syntactic structure information in a parse tree on relation extraction, we remove the entity-related information from parse trees by replacing the entity-related phrase types (&quot;E1-PER&quot; and so on as shown in Fig. 1) with &quot;NP&quot;. Table 1 compares the performance of 5 tree kernel setups on the ACE 2003 data using the tree structure information only. It shows that: * Overall the five different relation instance spaces are all somewhat effective for relation extraction. This suggests that structured syntactic information has good predication power for relation extraction and the structured syntactic information can be well captured by the tree kernel. * MCT performs much worse than the others.</Paragraph> <Paragraph position="2"> The reasons may be that MCT includes too much left and right context information, which may introduce many noisy features and cause over-fitting (high precision and very low recall as shown in Table 1). This suggests that only keeping the complete (not partial) production rules in MCT does harm performance.</Paragraph> <Paragraph position="3"> * PT achieves the best performance. This means that only keeping the portion of a parse tree enclosed by the shortest path between entities can model relations better than all others. This may be due to that most significant information is with PT and including context information may introduce too much noise. Although context may include some useful information, it is still a problem to correctly utilize such useful information in the tree kernel for relation extraction. * CPT performs a bit worse than PT. In some cases (e.g. in sentence &quot;the merge of company A and company B....&quot;, &quot;merge&quot; is a critical context word), the context information is helpful. However, the effective scope of context is hard to determine given the complexity and variability of natural languages.</Paragraph> <Paragraph position="4"> * The two flattened trees perform worse than the original trees. This suggests that the single non-terminal nodes are useful for relation extraction. Evaluation on the ACE 2004 data also shows that PT achieves the best performance (72.5/56.7 /63.6 in P/R/F). More evaluations with the entity type and order information incorporated into tree nodes (&quot;E1-PER&quot;, &quot;E2-PER&quot; and &quot;E-GPE&quot; as shown in Fig. 1) also show that PT performs best with 76.1/62.6/68.7 in P/R/F on the 2003 data and 74.1/62.4/67.7 in P/R/F on the 2004 data.</Paragraph> <Paragraph position="5"> ACE 2003 five major types using the parse tree structure information only (regardless of any entity-related information) kernel setups over the ACE major types of both the 2003 data (the numbers in parentheses) and the 2004 data (the numbers outside parentheses) (2) Composite Kernels: Table 2 compares the performance of different kernel setups on the ACE major types. It clearly shows that: * The composite kernels achieve significant performance improvement over the two individual kernels. This indicates that the flat and the structured features are complementary and the composite kernels can well integrate them: 1) the flat entity information captured by the entity kernel; 2) the structured syntactic connection information between the two entities captured by the tree kernel.</Paragraph> <Paragraph position="6"> * The composite kernel via the polynomial expansion outperforms the one via the linear combination by ~2 in F-measure. It suggests that the bi-gram entity features are very useful.</Paragraph> <Paragraph position="7"> * The entity features are quite useful, which can achieve F-measures of 54.4/48.2 alone and can boost the performance largely by ~7 (70.163.2/69.1-61.9) in F-measure when combining with the tree kernel.</Paragraph> <Paragraph position="8"> * It is interesting that the ACE 2004 data shows consistent better performance on all setups than the 2003 data although the ACE 2003 data is two times larger than the ACE 2004 data. This may be due to two reasons: 1) The ACE 2004 data defines two new entity types and re-defines the relation types and subtypes in order to reduce the inconsistency between LDC annotators. 2) More importantly, the ACE 2004 data defines 43 entity subtypes while there are only 3 subtypes in the 2003 data. The detailed classification in the 2004 data leads to significant performance improvement of 6.2 (54.4-48.2) in F-measure over that on the 2003 data.</Paragraph> <Paragraph position="9"> Our composite kernel can achieve 77.3/65.6/70.9 and 76.1/68.4/72.1 in P/R/F over the ACE 2003/2004 major types, respectively.</Paragraph> <Paragraph position="10"> Methods (2002/2003 data) P(%) R(%) F Ours: composite kernel 2 2003/2003 data over both 5 major types (the numbers outside parentheses) and 24 subtypes (the numbers in parentheses) Methods (2004 data) P(%) R(%) F Ours: composite kernel 2 2004 data over both 7 major types (the numbers outside parentheses) and 23 subtypes (the numbers in parentheses)</Paragraph> </Section> </Section> <Section position="7" start_page="829" end_page="830" type="metho"> <SectionTitle> (3) Performance Comparison: Tables 3 and 4 </SectionTitle> <Paragraph position="0"> compare our method with previous work on the ACE 2002/2003/2004 data, respectively. They show that our method outperforms the previous methods and significantly outperforms the previous two dependency kernels . This may be due to two reasons: 1) the dependency tree (Culotta and Sorensen, 2004) and the shortest path (Bunescu and Mooney, 2005) lack the internal hierarchical phrase structure information, so their corresponding kernels can only carry out node-matching directly over the nodes with word tokens; 2) the parse tree kernel has less constraints. That is, it is Bunescu and Mooney (2005) used the ACE 2002 corpus, including 422 documents, which is known to have many inconsistencies than the 2003 version. Culotta and Sorensen (2004) used a generic ACE corpus including about 800 documents (no corpus version is specified). Since the testing corpora are in different sizes and versions, strictly speaking, it is not ready to compare these methods exactly and fairly. Therefore Table 3 is only for reference purpose. We just hope that we can get a few clues from this table.</Paragraph> <Paragraph position="1"> not restricted by the two constraints of the two dependency kernels (identical layer and ancestors for the matchable nodes and identical length of two shortest paths, as discussed in Section 2). The above experiments verify the effectiveness of our composite kernels for relation extraction. They suggest that the parse tree kernel can effectively explore the syntactic features which are critical for relation extraction.</Paragraph> <Paragraph position="2"> both the 2003 and 2004 data for the composite kernel by polynomial expansion (4) Error Analysis: Table 5 reports the error distribution of the polynomial composite kernel over the major types on the ACE data. It shows that 83.5%(198+115/198+115+62) / 85.8%(416 +171/416+171+96) of the errors result from relation detection and only 16.5%/14.2% of the errors result from relation characterization. This may be due to data imbalance and sparseness issues since we find that the negative samples are 8 times more than the positive samples in the training set. Nevertheless, it clearly directs our future work.</Paragraph> </Section> <Section position="8" start_page="830" end_page="831" type="metho"> <SectionTitle> 5 Discussion </SectionTitle> <Paragraph position="0"> In this section, we compare our method with the previous work from the feature engineering viewpoint and report some other observations and issues in our experiments.</Paragraph> <Section position="1" start_page="830" end_page="830" type="sub_section"> <SectionTitle> 5.1 Comparison with Previous Work </SectionTitle> <Paragraph position="0"> This is to explain more about why our method performs better and significantly outperforms the previous two dependency tree kernels from the theoretical viewpoint.</Paragraph> <Paragraph position="1"> (1) Compared with Feature-based Methods: The basic difference lies in the relation instance representation (parse tree vs. feature vector) and the similarity calculation mechanism (kernel function vs. dot-product). The main difference is the different feature spaces. Regarding the parse tree features, our method implicitly represents a parse tree by a vector of integer counts of each sub-tree type, i.e., we consider the entire sub-tree types and their occurring frequencies. In this way, the parse tree-related features (the path features and the chunking features) used in the feature-based methods are embedded (as a subset) in our feature space. Moreover, the in-between word features and the entity-related features used in the feature-based methods are also captured by the tree kernel and the entity kernel, respectively. Therefore our method has the potential of effectively capturing not only most of the previous flat features but also the useful syntactic structure features.</Paragraph> <Paragraph position="2"> (2) Compared with Previous Kernels: Since our method only counts the occurrence of each sub-tree without considering the layer and the ancestors of the root node of the sub-tree, our method is not limited by the constraints (identical layer and ancestors for the matchable nodes, as discussed in Section 2) in Culotta and Sorensen (2004). Moreover, the difference between our method and Bunescu and Mooney (2005) is that their kernel is defined on the shortest path between two entities instead of the entire subtrees. However, the path does not maintain the tree structure information. In addition, their kernel requires the two paths to have the same length. Such constraint is too strict.</Paragraph> </Section> <Section position="2" start_page="830" end_page="831" type="sub_section"> <SectionTitle> 5.2 Other Issues </SectionTitle> <Paragraph position="0"> lution kernel is much slower compared to feature-based classifiers. In this paper, the speed issue is solved in three ways. First, the inclusion of the entity kernel makes the composite kernel converge fast. Furthermore, we find that the small portion (PT) of a full parse tree can effectively represent a relation instance. This significantly improves the speed. Finally, the parse tree kernel requires exact match between two subtrees, which normally does not occur very frequently. Collins and Duffy (2001) report that in practice, running time for the parse tree kernel is more close to linear (O(|N |). As a result, using the PC with Intel P4 3.0G CPU and 2G RAM, our system only takes about 110 minutes and 30 minutes to do training on the ACE 2003 (~77k training instances) and 2004 (~33k training instances) data, respectively.</Paragraph> <Paragraph position="1"> (2) Further Improvement: One of the potential problems in the parse tree kernel is that it carries out exact matches between sub-trees, so that this kernel fails to handle sparse phrases (i.e. &quot;a car&quot; vs. &quot;a red car&quot;) and near-synonymic grammar tags (for example, the variations of a verb (i.e. go, went, gone)). To some degree, it could possibly lead to over-fitting and compromise the per- null formance. However, the above issues can be handled by allowing grammar-driven partial rule matching and other approximate matching mechanisms in the parse tree kernel calculation. Finally, it is worth noting that by introducing more individual kernels our method can easily scale to cover more features from a multitude of sources (e.g. Wordnet, gazetteers, etc) that can be brought to bear on the task of relation extraction. In addition, we can also easily implement the feature weighting scheme by adjusting the eqn.(2) and the rule (2) in calculating (, )nn[?] (see subsection 3.1).</Paragraph> </Section> </Section> class="xml-element"></Paper>