File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/p04-1054_metho.xml
Size: 16,818 bytes
Last Modified: 2025-10-06 14:09:00
<?xml version="1.0" standalone="yes"?> <Paper uid="P04-1054"> <Title>Dependency Tree Kernels for Relation Extraction</Title> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Kernel Methods </SectionTitle> <Paragraph position="0"> In traditional machine learning, we are provided a set of training instances S = {x1 ...xN}, where each instance xi is represented by some d-dimensional feature vector. Much time is spent on the task of feature engineering - searching for the optimal feature set either manually by consulting domain experts or automatically through feature induction and selection (Scott and Matwin, 1999).</Paragraph> <Paragraph position="1"> For example, in entity detection the original instance representation is generally a word vector corresponding to a sentence. Feature extraction and induction may result in features such as part-ofspeech, word n-grams, character n-grams, capitalization, and conjunctions of these features. In the case of more structured objects, such as parse trees, features may include some description of the object's structure, such as &quot;has an NP-VP subtree.&quot; Kernel methods can be particularly effective at reducing the feature engineering burden for structured objects. By calculating the similarity between two objects, kernel methods can employ dynamic programming solutions to efficiently enumerate over substructures that would be too costly to explicitly include as features.</Paragraph> <Paragraph position="2"> Formally, a kernel function K is a mapping</Paragraph> <Paragraph position="4"> to a similarity score K(x,y) = summationtexti phi(x)phi(y) = ph(x) * ph(y). Here, phi(x) is some feature function over the instance x. The kernel function must be symmetric [K(x,y) = K(y,x)] and positivesemidefinite. By positive-semidefinite, we require that the if x1,...,xn [?] X, then the n x n matrix G defined by Gij = K(xi,xj) is positive semidefinite. It has been shown that any function that takes the dot product of feature vectors is a kernel function (Haussler, 1999).</Paragraph> <Paragraph position="5"> A simple kernel function takes the dot product of the vector representation of instances being compared. For example, in document classification, each document can be represented by a binary vector, where each element corresponds to the presence or absence of a particular word in that document.</Paragraph> <Paragraph position="6"> Here, phi(x) = 1 if word i occurs in document x.</Paragraph> <Paragraph position="7"> Thus, the kernel function K(x,y) returns the number of words in common between x and y. We refer to this kernel as the &quot;bag-of-words&quot; kernel, since it ignores word order.</Paragraph> <Paragraph position="8"> When instances are more structured, as in the case of dependency trees, more complex kernels become necessary. Haussler (1999) describes convolution kernels, which find the similarity between two structures by summing the similarity of their substructures. As an example, consider a kernel over strings. To determine the similarity between two strings, string kernels (Lodhi et al., 2000) count the number of common subsequences in the two strings, and weight these matches by their length.</Paragraph> <Paragraph position="9"> Thus, phi(x) is the number of times string x contains the subsequence referenced by i. These matches can be found efficiently through a dynamic program, allowing string kernels to examine long-range features that would be computationally infeasible in a feature-based method.</Paragraph> <Paragraph position="10"> Given a training set S = {x1 ...xN}, kernel methods compute the Gram matrix G such that Gij = K(xi,xj). Given G, the classifier finds a hyperplane which separates instances of different classes. To classify an unseen instance x, the classifier first projects x into the feature space defined by the kernel function. Classification then consists of determining on which side of the separating hyper-plane x lies.</Paragraph> <Paragraph position="11"> A support vector machine (SVM) is a type of classifier that formulates the task of finding the separating hyperplane as the solution to a quadratic programming problem (Cristianini and Shawe-Taylor, 2000). Support vector machines attempt to find a hyperplane that not only separates the classes but also maximizes the margin between them. The hope is that this will lead to better generalization performance on unseen instances.</Paragraph> </Section> <Section position="6" start_page="0" end_page="3" type="metho"> <SectionTitle> 4 Augmented Dependency Trees </SectionTitle> <Paragraph position="0"> Our task is to detect and classify relations between entities in text. We assume that entity tagging has been performed; so to generate potential relation instances, we iterate over all pairs of entities occurring in the same sentence. For each entity pair, we create an augmented dependency tree (described below) representing this instance. Given a labeled training set of potential relations, we define a tree kernel over dependency trees which we then use in an SVM to classify test instances.</Paragraph> <Paragraph position="1"> A dependency tree is a representation that denotes grammatical relations between words in a sentence (Figure 1). A set of rules maps a parse tree to a dependency tree. For example, subjects are dependent on their verbs and adjectives are dependent</Paragraph> <Paragraph position="3"> the dependency tree.</Paragraph> <Paragraph position="4"> on the nouns they modify. Note that for the purposes of this paper, we do not consider the link labels (e.g. &quot;object&quot;, &quot;subject&quot;); instead we use only the dependency structure. To generate the parse tree of each sentence, we use MXPOST, a maximum entropy statistical parser1; we then convert this parse tree to a dependency tree. Note that the left-to-right ordering of the sentence is maintained in the dependency tree only among siblings (i.e. the dependency tree does not specify an order to traverse the tree to recover the original sentence).</Paragraph> <Paragraph position="5"> For each pair of entities in a sentence, we find the smallest common subtree in the dependency tree that includes both entities. We choose to use this subtree instead of the entire tree to reduce noise and emphasize the local characteristics of relations.</Paragraph> <Paragraph position="6"> We then augment each node of the tree with a feature vector (Table 3). The relation-argument feature specifies whether an entity is the first or second argument in a relation. This is required to learn asymmetric relations (e.g. X OWNS Y).</Paragraph> <Paragraph position="7"> T with nodes {t0 ...tn}. The features of node ti are given by ph(ti) = {v1 ...vd}. We refer to the jth child of node ti as ti[j], and we denote the set of all children of node ti as ti[c]. We reference a subset j of children of ti by ti[j] [?] ti[c]. Finally, we refer to the parent of node ti as ti.p.</Paragraph> <Paragraph position="8"> From the example in Figure 1, t0[1] = t2,</Paragraph> <Paragraph position="10"> 5 Tree kernels for dependency trees We now define a kernel function for dependency trees. The tree kernel is a function K(T1,T2) that returns a normalized, symmetric similarity score in the range (0,1) for two trees T1 and T2. We define a slightly more general version of the kernel described by Zelenko et al. (2003).</Paragraph> <Paragraph position="11"> We first define two functions over the features of tree nodes: a matching function m(ti,tj) [?] {0,1} and a similarity function s(ti,tj) [?] (0,[?]]. Let the feature vector ph(ti) = {v1 ...vd} consist of two possibly overlapping subsets phm(ti) [?] ph(ti) and phs(ti) [?] ph(ti). We use phm(ti) in the matching function and phs(ti) in the similarity function. We</Paragraph> <Paragraph position="13"> where C(vq,vr) is some compatibility function between two feature values. For example, in the simplest case where</Paragraph> <Paragraph position="15"> s(ti,tj) returns the number of feature values in common between feature vectors phs(ti) and phs(tj).</Paragraph> <Paragraph position="16"> We can think of the distinction between functions m(ti,tj) and s(ti,tj) as a way to discretize the similarity between two nodes. If phm(ti) negationslash= phm(tj), then we declare the two nodes completely dissimilar. However, if phm(ti) = phm(tj), then we proceed to compute the similarity s(ti,tj). Thus, restricting nodes by m(ti,tj) is a way to prune the search space of matching subtrees, as shown below.</Paragraph> <Paragraph position="17"> For two dependency trees T1, T2, with root nodes r1 and r2, we define the tree kernel K(T1,T2) as follows:</Paragraph> <Paragraph position="19"> where Kc is a kernel function over children. Let a and b be sequences of indices such that a is a sequence a1 [?] a2 [?] ... [?] an, and likewise for b.</Paragraph> <Paragraph position="20"> Let d(a) = an [?]a1 +1 and l(a) be the length of a.</Paragraph> <Paragraph position="21"> Then we have Kc(ti[c],tj[c]) =</Paragraph> <Paragraph position="23"> The constant 0 < l < 1 is a decay factor that penalizes matching subsequences that are spread out within the child sequences. See Zelenko et al.</Paragraph> <Paragraph position="24"> (2003) for a proof that K is kernel function.</Paragraph> <Paragraph position="25"> Intuitively, whenever we find a pair of matching nodes, we search for all matching subsequences of the children of each node. A matching subsequence of children is a sequence of children a and b such that m(ai,bi) = 1 ([?]i < n). For each matching pair of nodes (ai,bi) in a matching subsequence, we accumulate the result of the similarity function s(ai,bj) and then recursively search for matching subsequences of their children ai[c], bj[c].</Paragraph> <Paragraph position="26"> We implement two types of tree kernels. A contiguous kernel only matches children subsequences that are uninterrupted by non-matching nodes. Therefore, d(a) = l(a). A sparse tree kernel, by contrast, allows non-matching nodes within matching subsequences.</Paragraph> <Paragraph position="27"> Figure 2 shows two relation instances, where each node contains the original text plus the features used for the matching function, phm(ti) = {generalpos, entity-type, relation-argument}. (&quot;NA&quot; denotes the feature is not present for this node.) The contiguous kernel matches the following substructures: {t0[0],u0[0]}, {t0[2],u0[1]}, {t3[0],u2[0]}. Because the sparse kernel allows non-contiguous matching sequences, it matches an additional sub-structure {t0[0,[?],2],u0[0,[?],1]}, where ([?]) indicates an arbitrary number of non-matching nodes.</Paragraph> <Paragraph position="28"> Zelenko et al. (2003) have shown the contiguous kernel to be computable in O(mn) and the sparse kernel in O(mn3), where m and n are the number of children in trees T1 and T2 respectively.</Paragraph> </Section> <Section position="7" start_page="3" end_page="3" type="metho"> <SectionTitle> 6 Experiments </SectionTitle> <Paragraph position="0"> We extract relations from the Automatic Content Extraction (ACE) corpus provided by the National Institute for Standards and Technology (NIST). The</Paragraph> <Paragraph position="2"> data consists of about 800 annotated text documents gathered from various newspapers and broadcasts.</Paragraph> <Paragraph position="3"> Five entities have been annotated (PERSON, ORGA-NIZATION, GEO-POLITICAL ENTITY, LOCATION, FACILITY), along with 24 types of relations (Table 2). As noted from the distribution of relationship types in the training data (Figure 3), data imbalance and sparsity are potential problems.</Paragraph> <Paragraph position="4"> In addition to the contiguous and sparse tree kernels, we also implement a bag-of-words kernel, which treats the tree as a vector of features over nodes, disregarding any structural information. We also create composite kernels by combining the sparse and contiguous kernels with the bag-of-words kernel. Joachims et al. (2001) have shown that given two kernels K1, K2, the composite kernel K12(xi,xj) = K1(xi,xj)+K2(xi,xj) is also a kernel. We find that this composite kernel improves performance when the Gram matrix G is sparse (i.e.</Paragraph> <Paragraph position="5"> our instances are far apart in the kernel space).</Paragraph> <Paragraph position="6"> The features used to represent each node are shown in Table 3. After initial experimentation, the set of features we use in the matching function is phm(ti) = {general-pos, entity-type, relationargument}, and the similarity function examines the remaining features.</Paragraph> <Paragraph position="7"> In our experiments we tested the following five kernels:</Paragraph> <Paragraph position="9"> We also experimented with the function C(vq,vr), the compatibility function between two feature values. For example, we can increase the importance of two nodes having the same Wordnet hypernym2.</Paragraph> <Paragraph position="10"> If vq, vr are hypernym features, then we can define</Paragraph> <Paragraph position="12"> When a > 1, we increase the similarity of nodes that share a hypernym. We tested a number of weighting schemes, but did not obtain a set of weights that produced consistent significant improvements. See Section 8 for alternate approaches to setting C.</Paragraph> <Paragraph position="13"> an SVM. (We augment the LibSVM3 implementation to include our dependency tree kernel.) Note that, although training was done over all 24 relation subtypes, we evaluate only over the 5 high-level relation types. Thus, classifying a RESIDENCE relation as a LOCATED relation is deemed correct4. Note also that K0 is not included in Table 4 because of burdensome computational time. Table 4 shows that precision is adequate, but recall is low. This is a result of the aforementioned class imbalance very few of the training examples are relations, so the classifier is less likely to identify a testing instances as a relation. Because we treat every pair of mentions in a sentence as a possible relation, our training set contains fewer than 15% positive relation instances.</Paragraph> <Paragraph position="14"> To remedy this, we retrain each SVMs for a binary classification task. Here, we detect, but do not classify, relations. This allows us to combine all positive relation instances into one class, which provides us more training samples to estimate the class boundary. We then threshold our output to achieve an optimal operating point. As seen in Table 5, this method of relation detection outperforms that of the multi-class classifier.</Paragraph> <Paragraph position="15"> We then use these binary classifiers in a cascading scheme as follows: First, we use the binary SVM to detect possible relations. Then, we use the SVM trained only on positive relation instances to classify each predicted relation. These results are shown in Table 6.</Paragraph> <Paragraph position="16"> The first result of interest is that the sparse tree kernel, K0, does not perform as well as the contiguous tree kernel, K1. Suspecting that noise was introduced by the non-matching nodes allowed in the sparse tree kernel, we performed the experiment with different values for the decay factor l = {.9,.5,.1}, but obtained no improvement.</Paragraph> <Paragraph position="17"> The second result of interest is that all tree kernels outperform the bag-of-words kernel, K2, most noticeably in recall performance, implying that the and C denote the kernel used for relation detection and classification, respectively.</Paragraph> <Paragraph position="18"> structural information the tree kernel provides is extremely useful for relation detection.</Paragraph> <Paragraph position="19"> Note that the average results reported here are representative of the performance per relation, except for the NEAR relation, which had slightly lower results overall due to its infrequency in training.</Paragraph> </Section> <Section position="8" start_page="3" end_page="3" type="metho"> <SectionTitle> 7 Conclusions </SectionTitle> <Paragraph position="0"> We have shown that using a dependency tree kernel for relation extraction provides a vast improvement over a bag-of-words kernel. While the dependency tree kernel appears to perform well at the task of classifying relations, recall is still relatively low. Detecting relations is a difficult task for a kernel method because the set of all non-relation instances is extremely heterogeneous, and is therefore difficult to characterize with a similarity metric. An improved system might use a different method to detect candidate relations and then use this kernel method to classify the relations.</Paragraph> </Section> class="xml-element"></Paper>