File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/e06-1015_metho.xml

Size: 24,560 bytes

Last Modified: 2025-10-06 14:10:04

<?xml version="1.0" standalone="yes"?>
<Paper uid="E06-1015">
  <Title>Making Tree Kernels practical for Natural Language Learning</Title>
  <Section position="3" start_page="113" end_page="115" type="metho">
    <SectionTitle>
2 Fast Parse Tree Kernels
</SectionTitle>
    <Paragraph position="0"> The kernels that we consider represent trees in terms of their substructures (fragments). These latter define feature spaces which, in turn, are mapped into vector spaces, e.g. Rfracturn. The associated kernel function measures the similarity between two trees by counting the number of their common fragments. More precisely, a kernel function detects if a tree subpart (common to both trees) belongs to the feature space that we intend to generate. For such purpose, the fragment types need to be described. We consider two important characterizations: the subtrees (STs) and the sub-set trees (SSTs).</Paragraph>
    <Section position="1" start_page="113" end_page="114" type="sub_section">
      <SectionTitle>
2.1 Subtrees and Subset Trees
</SectionTitle>
      <Paragraph position="0"> In our study, we consider syntactic parse trees, consequently, each node with its children is associated with a grammar production rule, where the symbol at left-hand side corresponds to the parent node and the symbols at right-hand side are associated with its children. The terminal symbols of the grammar are always associated with the leaves of the tree. For example, Figure 1 illustrates the syntactic parse of the sentence &amp;quot;Mary brought a cat to school&amp;quot;.</Paragraph>
      <Paragraph position="2"> We define as a subtree (ST) any node of a tree along with all its descendants. For example, the line in Figure 1 circles the subtree rooted in the NP node. A subset tree (SST) is a more general structure. The difference with the subtrees is that the leaves can be associated with non-terminal symbols. The SSTs satisfy the constraint that they are generated by applying the same grammatical rule set which generated the original tree. For example, [S [N VP]] is a SST of the tree in Figure 1 which has two non-terminal symbols, N and VP, as leaves.</Paragraph>
      <Paragraph position="3">  Given a syntactic tree we can use as feature representation the set of all its STs or SSTs. For example, Figure 2 shows the parse tree of the sentence &amp;quot;Mary brought a cat&amp;quot; together with its 6 STs, whereas Figure 3 shows 10 SSTs (out of 17) of the subtree of Figure 2 rooted in VP. The  high different number of substructures gives an intuitive quantification of the different information level between the two tree-based representations.</Paragraph>
    </Section>
    <Section position="2" start_page="114" end_page="114" type="sub_section">
      <SectionTitle>
2.2 The Tree Kernel Functions
</SectionTitle>
      <Paragraph position="0"> The main idea of tree kernels is to compute the number of the common substructures between two trees T1 and T2 without explicitly considering the whole fragment space. For this purpose, we slightly modified the kernel function proposed in (Collins and Duffy, 2002) by introducing a parameter s which enables the ST or the SST evaluation. Given the set of fragments {f1,f2,..} = F, we defined the indicator function Ii(n) which is equal</Paragraph>
      <Paragraph position="2"> where NT1 and NT2 are the sets of the T1's and T2's nodes, respectively and [?](n1,n2) =summationtext |F|  i=1 Ii(n1)Ii(n2). This latter is equal to the number of common fragments rooted in the n1 and n2 nodes. We can compute [?] as follows: 1. if the productions at n1 and n2 are different then [?](n1,n2) = 0; 2. if the productions at n1 and n2 are the same, and n1 and n2 have only leaf children (i.e. they are pre-terminals symbols) then [?](n1,n2) = 1; 3. if the productions at n1 and n2 are the same, and n1 and n2 are not pre-terminals then</Paragraph>
      <Paragraph position="4"> where s [?] {0,1}, nc(n1) is the number of the children of n1 and cjn is the j-th child of the node n. Note that, since the productions are the same,</Paragraph>
      <Paragraph position="6"> [?]j [?](cjn1,cjn2) = 1, i.e. all the productions associated with the children are identical. By recursively applying this property, it follows that the subtrees in n1 and n2 are identical. Thus, Eq. 1 evaluates the subtree (ST) kernel. When s = 1, [?](n1,n2) evaluates the number of SSTs common to n1 and n2 as proved in (Collins and Duffy, 2002).</Paragraph>
      <Paragraph position="7"> Additionally, we study some variations of the above kernels which include the leaves in the fragment space. For this purpose, it is enough to add the condition: 0. if n1 and n2 are leaves and their associated symbols are equal then [?](n1,n2) = 1, to the recursive rule set for the [?] evaluation (Zhang and Lee, 2003). We will refer to such extended kernels as ST+bow and SST+bow (bag-ofwords). null Moreover, we add the decay factor l by modifying steps (2) and (3) as follows1:  2. [?](n1,n2) = l, 3. [?](n1,n2) = lproducttextnc(n1)j=1 (s + [?](cjn1,cjn2)).  The computational complexity of Eq. 1 is O(|NT1 |x |NT2|). We will refer to this basic implementation as the Quadratic Tree Kernel (QTK). However, as observed in (Collins and Duffy, 2002) this worst case is quite unlikely for the syntactic trees of natural language sentences, thus, we can design algorithms that run in linear time on average. null function Evaluate Pair Set(Tree T1, T2) returns NODE PAIR SET;</Paragraph>
      <Paragraph position="9"> n2=get next elem(L2); /*get the head element and move the pointer to the next element*/</Paragraph>
      <Paragraph position="11"> reset(L2); /*set the pointer at the first element*/</Paragraph>
      <Paragraph position="13"/>
    </Section>
    <Section position="3" start_page="114" end_page="115" type="sub_section">
      <SectionTitle>
2.3 A Fast Tree Kernel (FTK)
</SectionTitle>
      <Paragraph position="0"> To compute the kernels defined in the previous section, we sum the [?] function for each pair &lt;n1,n2&gt; [?] NT1 x NT2 (Eq. 1). When the productions associated with n1 and n2 are different, we can avoid to evaluate [?](n1,n2) since it is 0.</Paragraph>
      <Paragraph position="1">  Thus, we look for a node pair set Np ={&lt;n1,n2&gt; [?] NT1 x NT2 : p(n1) = p(n2)}, where p(n) returns the production rule associated with n.</Paragraph>
      <Paragraph position="2"> To efficiently build Np, we (i) extract the L1 and L2 lists of the production rules from T1 and T2, (ii) sort them in the alphanumeric order and (iii) scan them to find the node pairs &lt;n1,n2&gt; such that</Paragraph>
      <Paragraph position="4"> only O(|NT1|+|NT2|) time, but, if p(n1) appears r1 times in T1 and p(n2) is repeated r2 times in T2, we need to consider r1 x r2 pairs. The formal algorithm is given in Table 1.</Paragraph>
      <Paragraph position="5"> Note that:  (a) The list sorting can be done only once at the data preparation time (i.e. before training) in O(|NT1 |x log(|NT1|)).</Paragraph>
      <Paragraph position="6"> (b) The algorithm shows that the worst case oc null curs when the parse trees are both generated using only one production rule, i.e. the two internal while cycles carry out |NT1|x|NT2 |iterations. In contrast, two identical parse trees may generate a linear number of non-null pairs if there are few groups of nodes associated with the same production rule.</Paragraph>
      <Paragraph position="7"> (c) Such approach is perfectly compatible with the dynamic programming algorithm which computes [?]. In fact, the only difference with the original approach is that the matrix entries corresponding to pairs of different production rules are not considered. Since such entries contain null values they do not affect the application of the original dynamic programming. Moreover, the order of the pair evaluation can be established at run time, starting from the root nodes towards the children.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="115" end_page="116" type="metho">
    <SectionTitle>
3 A Semantic Application of Parse Tree
Kernels
</SectionTitle>
    <Paragraph position="0"> An interesting application of the SST kernel is the classification of the predicate argument structures defined in PropBank (Kingsbury and Palmer, 2002) or FrameNet (Fillmore, 1982). Figure 4 shows the parse tree of the sentence: &amp;quot;Mary brought a cat to school&amp;quot; along with the predicate argument annotation proposed in the Prop-Bank project. Only verbs are considered as predicates whereas arguments are labeled sequentially from ARG0 to ARG9.</Paragraph>
    <Paragraph position="1"> Also in FrameNet predicate/argument information is described but for this purpose richer semantic structures called Frames are used. The Frames are schematic representations of situations involving various participants, properties and roles in which a word may be typically used. Frame elements or semantic roles are arguments of predicates called target words. For example the following sentence is annotated according to the AR- null sixteen teenagers].</Paragraph>
    <Paragraph position="2"> The roles Suspect and Authorities are specific to the frame.</Paragraph>
    <Paragraph position="3"> The common approach to learn the classification of predicate arguments relates to the extraction of features from the syntactic parse tree of the target sentence. In (Gildea and Jurafsky, 2002) seven different features2, which aim to capture the relation between the predicate and its arguments, were proposed. For example, the Parse Tree Path of the pair &lt;brought, ARG1&gt; in the syntactic tree of Figure 4 is V  |VP  |NP. It encodes the dependency between the predicate and the argument as a sequence of nonterminal labels linked by direction symbols (up or down).</Paragraph>
    <Paragraph position="4"> An alternative tree kernel representation, proposed in (Moschitti, 2004), is the selection of the minimal tree subset that includes a predicate with only one of its arguments. For example, in Figure 4, the substructures inside the three frames are the semantic/syntactic structures associated with the three arguments of the verb to bring, i.e. SARG0, SARG1 and SARGM.</Paragraph>
    <Paragraph position="5"> Given a feature representation of predicate ar2Namely, they are Phrase Type, Parse Tree Path, Predicate Word, Head Word, Governing Category, Position and Voice.</Paragraph>
    <Paragraph position="6">  guments, we can build an individual ONE-vs-ALL (OVA) classifier Ci for each argument i. As a final decision of the multiclassifier, we select the argument type ARGt associated with the maximum value among the scores provided by the Ci, i.e.</Paragraph>
    <Paragraph position="8"> of argument types. We adopted the OVA approach as it is simple and effective as showed in (Pradhan et al., 2004).</Paragraph>
    <Paragraph position="9"> Note that the representation in Figure 4 is quite intuitive and, to conceive it, the designer requires much less linguistic knowledge about semantic roles than those necessary to define relevant features manually. To understand such point, we should make a step back before Gildea and Jurafsky defined the first set of features for Semantic Role Labeling (SRL). The idea that syntax may have been useful to derive semantic information was already inspired by linguists, but from a machine learning point of view, to decide which tree fragments may have been useful for semantic role labeling was not an easy task. In principle, the designer should have had to select and experiment all possible tree subparts. This is exactly what the tree kernels can automatically do: the designer just need to roughly select the interesting whole sub-tree (correlated with the linguistic phenomenon) and the tree kernel will generate all possible syntactic features from it. The task of selecting the most relevant substructures is carried out by the kernel machines themselves.</Paragraph>
  </Section>
  <Section position="5" start_page="116" end_page="118" type="metho">
    <SectionTitle>
4 The Experiments
</SectionTitle>
    <Paragraph position="0"> The aim of the experiments is twofold. On the one hand, we show that the FTK running time is linear on the average case and is much faster than QTK.</Paragraph>
    <Paragraph position="1"> This is accomplished by measuring the learning time and the average kernel computation time. On the other hand, we study the impact of the different tree based kernels on the predicate argument classification accuracy.</Paragraph>
    <Section position="1" start_page="116" end_page="116" type="sub_section">
      <SectionTitle>
4.1 Experimental Set-up
</SectionTitle>
      <Paragraph position="0"> We used two different corpora: PropBank (www.cis.upenn.edu/[?]ace) along with PennTree bank 2 (Marcus et al., 1993) and FrameNet. PropBank contains about 53,700 sentences and a fixed split between training and testing which has been used in other researches, e.g. (Gildea and Palmer, 2002; Pradhan et al., 2004). In this split, sections from 02 to 21 are used for training, section 23 for testing and sections 1 and 22 as developing set. We considered a total of 122,774 and 7,359 arguments (from ARG0 to ARG9, ARGA and ARGM) in training and testing, respectively.</Paragraph>
      <Paragraph position="1"> Their tree structures were extracted from the Penn Treebank. It should be noted that the main contribution to the global accuracy is given by ARG0, ARG1 and ARGM.</Paragraph>
      <Paragraph position="2"> From the FrameNet corpus (http://www.icsi .berkeley.edu/[?]framenet), we extracted all 24,558 sentences of the 40 Frames selected for the Automatic Labeling of Semantic Roles task of Senseval 3 (www.senseval.org). We mapped together the semantic roles having the same name and we considered only the 18 most frequent roles associated with verbal predicates, for a total of 37,948 arguments. We randomly selected 30% of sentences for testing and 70% for training. Additionally, 30% of training was used as a validationset. Note that, since the FrameNet data does not include deep syntactic tree annotation, we processed the FrameNet data with Collins' parser (Collins, 1997), consequently, the experiments on FrameNet relate to automatic syntactic parse trees.</Paragraph>
      <Paragraph position="3"> The classifier evaluations were carried out with the SVM-light-TK software available at http://ai-nlp.info.uniroma2.it/moschitti/ which encodes ST and SST kernels in the SVM-light software (Joachims, 1999). We used the default linear (Linear) and polynomial (Poly) kernels for the evaluations with the standard features defined in (Gildea and Jurafsky, 2002).</Paragraph>
      <Paragraph position="4"> We adopted the default regularization parameter (i.e., the average of 1/||vectorx||) and we tried a few cost-factor values (i.e., j [?] {1,3,7,10,30,100}) to adjust the rate between Precision and Recall on the validation-set.</Paragraph>
      <Paragraph position="5"> For the ST and SST kernels, we derived that the best l (see Section 2.2) were 1 and 0.4, respectively. The classification performance was evaluated using the F1 measure3 for the single arguments and the accuracy for the final multiclassifier. This latter choice allows us to compare our results with previous literature work, e.g. (Gildea and Jurafsky, 2002; Pradhan et al., 2004).</Paragraph>
    </Section>
    <Section position="2" start_page="116" end_page="117" type="sub_section">
      <SectionTitle>
4.2 Time Complexity Experiments
</SectionTitle>
      <Paragraph position="0"> In this section we compare our Fast Tree Kernel (FTK) approach with the Quadratic Tree Kernel (QTK) algorithm. The latter refers to the naive evaluation of Eq. 1 as presented in (Collins and Duffy, 2002).</Paragraph>
      <Paragraph position="1"> 3F1 assigns equal importance to Precision P and Recall R, i.e. f1 = 2PxRP+R .</Paragraph>
      <Paragraph position="2">  Figure 5 shows the learning time4 of the SVMs using QTK and FTK (over the SST structures) for the classification of one large argument (i.e. ARG0), according to different percentages of training data. We note that, with 70% of the training data, FTK is about 10 times faster than QTK. With all the training data FTK terminated in 6 hours whereas QTK required more than 1 week.</Paragraph>
      <Paragraph position="4"/>
      <Paragraph position="6"> training set percentages.</Paragraph>
      <Paragraph position="7"> 4We run the experiments on a Pentium 4, 2GHz, with 1 Gb ram.</Paragraph>
      <Paragraph position="8"> The above results are quite interesting because they show that (1) we can use tree kernels with SVMs on huge training sets, e.g. on 122,774 instances and (2) the time needed to converge is approximately the one required by SVMs when using polynomial kernel. This latter shows the minimal complexity needed to work in the dual space. To study the FTK running time, we extracted from PennTree bank the first 500 trees5 containing exactly n nodes, then, we evaluated all 25,000 possible tree pairs. Each point of the Figure 6 shows the average computation time on all the tree pairs of a fixed size n.</Paragraph>
      <Paragraph position="9"> In the figures, the trend lines which best interpolates the experimental values are also shown. It clearly appears that the training time is quadratic as SVMs have quadratic learning time complexity (see Figure 5) whereas the FTK running time has a linear behavior (Figure 6). The QTK algorithm shows a quadratic running time complexity, as expected. null</Paragraph>
    </Section>
    <Section position="3" start_page="117" end_page="118" type="sub_section">
      <SectionTitle>
4.3 Accuracy of the Tree Kernels
</SectionTitle>
      <Paragraph position="0"> In these experiments, we investigate which kernel is the most accurate for the predicate argument classification.</Paragraph>
      <Paragraph position="1"> First, we run ST, SST, ST+bow, SST+bow, Linear and Poly kernels over different training-set size of PropBank. Figure 7 shows the learning curves associated with the above kernels for the SVM-based multiclassifier. We note that (a) SSTs have a higher accuracy than STs, (b) bow does not improve either ST or SST kernels and (c) in the final part of the plot SST shows a higher gradient than ST, Linear and Poly. This latter produces the best accuracy 90.5% in line with the literature findings using standard features and polynomial SVMs, e.g. 87.1%6 in (Pradhan et al., 2004). Second, in tables 2 and 3, we report the results using all available training data, on PropBank and FrameNet test sets, respectively. Each row of the two tables shows the F1 measure of the individual classifiers using different kernels whereas the last column illustrates the global accuracy of the multiclassifier.</Paragraph>
      <Paragraph position="2"> 5We measured also the computation time for the incomplete trees associated with the predicate argument structures (see Section 3); we obtained the same results.</Paragraph>
      <Paragraph position="3"> 6The small difference (2.4%) is mainly due to the different treatment of ARGMs: we built a single ARGM class for all subclasses, e.g. ARGM-LOC and ARGM-TMP, whereas in (Pradhan et al., 2004), the ARGMs, were evaluated separately. null  We note that, the F1 of the single arguments across the different kernels follows the same behavior of the global multiclassifier accuracy. On FrameNet, the bow impact on the ST and SST accuracy is higher than on PropBank as it produces an improvement of about 1.5%. This suggests that (1) to detect semantic roles, lexical information is very important, (2) bow give a higher contribution as errors in POS-tagging make the word + POS fragments less reliable and (3) as the FrameNet trees are obtained with the Collins' syntactic parser, tree kernels seem robust to incorrect parse trees.</Paragraph>
      <Paragraph position="4"> Third, we point out that the polynomial kernel on flat features is more accurate than tree kernels but the design of such effective features required noticeable knowledge and effort (Gildea and Jurafsky, 2002). On the contrary, the choice of subtrees suitable to syntactically characterize a target phenomenon seems a easier task (see Section 3 for the predicate argument case). Moreover, by combining polynomial and SST kernels, we can improve the classification accuracy (Moschitti, 2004), i.e. tree kernels provide the learning algorithm with many relevant fragments which hardly can be designed by hand. In fact, as many predicate argument structures are quite large (up to 100 nodes) they contain many fragments.</Paragraph>
      <Paragraph position="5">  roles.</Paragraph>
      <Paragraph position="6"> Finally, to study the combined kernels, we applied the K1 + gK2 formula, where K1 is either the Linear or the Poly kernel and K2 is the ST  tions.</Paragraph>
      <Paragraph position="7"> or the SST kernel. Table 4 shows the results of four kernel combinations. We note that, (a) STs and SSTs improve Poly (about 0.5 and 2 percent points on PropBank and FrameNet, respectively) and (b) the linear kernel, which uses fewer features than Poly, is more enhanced by the SSTs than STs (for example on PropBank we have 89.4% and 88.6% vs. 87.6%), i.e. Linear takes advantage by the richer feature set of the SSTs. It should be noted that our results of kernel combinations on FrameNet are in contrast with (Moschitti, 2004), where no improvement was obtained. Our explanation is that, thanks to the fast evaluation of FTK, we could carry out an adequate parameterization.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="118" end_page="119" type="metho">
    <SectionTitle>
5 Related Work
</SectionTitle>
    <Paragraph position="0"> Recently, several tree kernels have been designed.</Paragraph>
    <Paragraph position="1"> In the following, we highlight their differences and properties.</Paragraph>
    <Paragraph position="2"> In (Collins and Duffy, 2002), the SST tree kernel was experimented with the Voted Perceptron for the parse-tree reranking task. The combination with the original PCFG model improved the syntactic parsing. Additionally, it was alluded that the average execution time depends on the number of repeated productions.</Paragraph>
    <Paragraph position="3"> In (Vishwanathan and Smola, 2002), a linear complexity algorithm for the computation of the ST kernel is provided (in the worst case). The main idea is the use of the suffix trees to store partial matches for the evaluation of the string kernel (Lodhi et al., 2000). This can be used to compute the ST fragments once the tree is converted into a string. To our knowledge, ours is the first application of the ST kernel for a natural language task. In (Kazama and Torisawa, 2005), an interesting algorithm that speeds up the average running time is presented. Such algorithm looks for node pairs that have in common a large number of trees (malicious nodes) and applies a transformation to the trees rooted in such nodes to make faster the kernel computation. The results show an increase of the speed similar to the one produced by our method.</Paragraph>
    <Paragraph position="4"> In (Zelenko et al., 2003), two kernels over syntactic shallow parser structures were devised for the extraction of linguistic relations, e.g. personaffiliation. To measure the similarity between two  nodes, the contiguous string kernel and the sparse string kernel (Lodhi et al., 2000) were used. In (Culotta and Sorensen, 2004) such kernels were slightly generalized by providing a matching function for the node pairs. The time complexity for their computation limited the experiments on data set of just 200 news items. Moreover, we note that the above tree kernels are not convolution kernels as those proposed in this article.</Paragraph>
    <Paragraph position="5"> In (Shen et al., 2003), a tree-kernel based on Lexicalized Tree Adjoining Grammar (LTAG) for the parse-reranking task was proposed. Since QTK was used for the kernel computation, the high learning complexity forced the authors to train different SVMs on different slices of training data. Our FTK, adapted for the LTAG tree kernel, would have allowed SVMs to be trained on the whole data.</Paragraph>
    <Paragraph position="6"> In (Cumby and Roth, 2003), a feature description language was used to extract structural features from the syntactic shallow parse trees associated with named entities. The experiments on the named entity categorization showed that when the description language selects an adequate set of tree fragments the Voted Perceptron algorithm increases its classification accuracy. The explanation was that the complete tree fragment set contains many irrelevant features and may cause overfitting. null</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML