File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-2607_metho.xml
Size: 19,398 bytes
Last Modified: 2025-10-06 14:10:52
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-2607"> <Title>Tree Kernel Engineering in Semantic Role Labeling Systems</Title> <Section position="3" start_page="49" end_page="50" type="metho"> <SectionTitle> 2 Preliminary Concepts </SectionTitle> <Paragraph position="0"> In this section we briefly define the SRL model that we intend to design and the kernel function that we use to evaluate the similarity between subtrees. null</Paragraph> <Section position="1" start_page="49" end_page="50" type="sub_section"> <SectionTitle> 2.1 Basic SRL approach </SectionTitle> <Paragraph position="0"> The SRL approach that we adopt is based on the deep syntactic parse (Charniak, 2000) of the sentence that we intend to annotate semantically. The standard algorithm is to classify the tree node pair <p,a> , where p and a are the nodes that exactly cover the target predicate and a potential argument, respectively. If <p,a> is labeled with an argument, then the terminal nodes dominated by a will be considered as the words constituting such argument. The number of pairs for each sentence can be hundreds, thus, if we consider training corpora of thousands of sentences, we have to deal with millions of training instances.</Paragraph> <Paragraph position="1"> The usual solution to limit such complexity is to divide the labeling task in two subtasks: * Boundary detection, in which a single classifier is trained on many instances to detect if a node is an argument or not, i.e. if the sequence of words dominated by the target node constitutes a correct boundary.</Paragraph> <Paragraph position="2"> * Argument classification: only the set of nodes corresponding to correct boundaries are considered. These can be used to train a multiclassifier that, for such nodes, only decides the type of the argument. For example, we can train n classifiers in the style One-vs-All. At classification time, for each argument node, we can select the argument type associated with the maximum among the n scores provided by the single classifiers.</Paragraph> <Paragraph position="3"> We adopt this solution as it enables us to use only one computationally expensive classifier, i.e. the boundary detection one. This, as well as the argument classifiers, requires a feature representation of the predicate-argument pair. Such features are mainly extracted from the parse trees of the target sentence, e.g. Phrase Type, Predicate Word, Head Word, Governing Category, Position andVoiceproposedin(GildeaandJurasfky, 2002).</Paragraph> <Paragraph position="4"> As most of the features proposed in literature are subsumed by tree fragments, tree-kernel functions are a natural way to produce them automatically. null</Paragraph> </Section> <Section position="2" start_page="50" end_page="50" type="sub_section"> <SectionTitle> 2.2 Tree kernel functions </SectionTitle> <Paragraph position="0"> Tree-kernel functions simply evaluate the number of substructures shared between two trees T1 and T2. Such functions can be seen as a scalar product in the huge vector space constituted by all possible substructures of the training set. Thus, kernel functions implicitly define a large feature space.</Paragraph> <Paragraph position="1"> Formally, given a tree fragment space {f1,f2,..} = F, we can define an indicator function Ii(n), which is equal to 1 if the target fi is rooted at node n and equal to</Paragraph> <Paragraph position="3"> where NT1 and NT2 are the sets of the T1's and T2's nodes, respectively and</Paragraph> <Paragraph position="5"> is equal to the number of common fragments rooted at nodes n1 and n2 and, according to (Collins and Duffy, 2002), it can be computed as follows: 1. if the productions at n1 and n2 are different then [?](n1,n2) = 0; 2. if the productions at n1 and n2 are the same, and n1 and n2 have only leaf children (i.e. they are pre-terminal symbols) then [?](n1,n2) = l; 3. if the productions at n1 and n2 are the same, and n1 and n2 are not pre-terminal then</Paragraph> <Paragraph position="7"> where l is the decay factor to scale down the impact of large structures, nc(n1) is the number of the children of n1 and cjn is the j-th child of the noden. Note that, as the productions are the same, nc(n1) = nc(n2). Additionally, to map similarity scores in the [0,1] range, we applied a nor- null malization in the kernel space, i.e. Kprime(T1,T2) =</Paragraph> <Paragraph position="9"> Once a kernel function is defined, we need to characterize the predicate-argument pair with a subtree. This allows kernel machines to generate a large number of syntactic features related to such pair. The approach proposed in (Moschitti, 2004) selects the minimal subtree that includes a predicate with its argument. We follow such approach by studying and proposing novel, interesting solutions. null</Paragraph> </Section> </Section> <Section position="4" start_page="50" end_page="52" type="metho"> <SectionTitle> 3 Novel Kernels for SRL </SectionTitle> <Paragraph position="0"> The basic structure used to characterize the predicate argument relation is the smallest subtree that includes a predicate with one of its argument. For example, in Figure 1, the dashed line encloses a predicate argument feature (PAF) over the parse tree of the sentence: &quot;Paul delivers a talk in formalstyle&quot;. ThisPAFisasubtreethatcharacterizes the predicate to deliver with its argument a talk.</Paragraph> <Paragraph position="1"> In this section, we improve PAFs, propose different kernels for internal and pre-terminal nodes and new kernels based on complete predicate ar-</Paragraph> <Section position="1" start_page="51" end_page="51" type="sub_section"> <SectionTitle> 3.1 Improving PAF </SectionTitle> <Paragraph position="0"> PAFs have shown to be very effective for argument classification but not for boundary detection.</Paragraph> <Paragraph position="1"> The reason is that two nodes that encode correct and incorrect boundaries may generate very similar PAFs. For example, Figure 3.A shows two PAFs corresponding to a correct (PAF+) and an incorrect (PAF-) choice of the boundary for A1: PAF+fromtheNPvs. PAF-fromtheNnodes. The number of their common substructures is high, i.e.</Paragraph> <Paragraph position="2"> the four subtrees shown in Frame C. This prevents the algorithm from making different decisions for such cases.</Paragraph> <Paragraph position="3"> To solve this problem, we specify which is the node that exactly covers the argument (also called argument node) by simply marking it with the label B denoting the boundary property. Figure 3.B shows the two new marked PAFs (MPAFs). The features generated from the two subtrees are now very different so that there is only one substructure in common (see Frame D). Note that, each markup strategy impacts on the output of a kernel function in terms of the number of structures common to two trees. The same output can be obtained using unmarked trees and redefining consistently the kernel function, e.g. the algorithm described in Section 2.2.</Paragraph> <Paragraph position="4"> An alternative way to partially solve the structure overlapping problem is the use of two different classifiers, one for the internal nodes and one for the pre-terminal nodes, and combining their decisions. In this way, the negative example of Figure 3 would not be used to train the same classifier that uses PAF+. Of course, similar structures can both be rooted on internal nodes, therefore they can belong to the training data of the same classifier. However, the use of different classifiers is motivated also by the fact that many argument types can be found mostly in pre-terminal nodes, e.g. modifier or negation arguments, and do not necessitate training data extracted from internal nodes. Consequently, it is more convenient (at least from a computational point of view) to use two different boundary classifiers, hereinafter referred to as combined classifier.</Paragraph> </Section> <Section position="2" start_page="51" end_page="52" type="sub_section"> <SectionTitle> 3.2 Kernels on complete predicate argument structures </SectionTitle> <Paragraph position="0"> The type of a target argument strongly depends on the type and number of the predicate's arguments1 (Punyakanok et al., 2005; Toutanova et al., 2005).</Paragraph> <Paragraph position="1"> Consequently, to correctly label an argument, we should extract features from the complete predicate argument structure it belongs to. In contrast, PAFs completely neglect the information (i.e. the tree portions) related to non-target arguments.</Paragraph> <Paragraph position="2"> One way to use this further information with tree kernels is to use the minimum subtree that spans all the predicate's arguments. The whole parse tree in Figure 1 is an example of such Minimum Spanning Tree (MST) as it includes all and only the argument structures of the predicate &quot;to deliver&quot;. However, MSTs pose some problems: * We cannot use them for the boundary detection task since we do not know the predicate's argument structure yet. However, we can derive the MST (its approximation) from the nodes selected by a boundary classifier, i.e. the nodes that correspond to potential arguments. Such approximated MSTs can be easily used in the argument type classification phase. They can also be used to re-rank the most probable m sequences of arguments for both labeling phases.</Paragraph> <Paragraph position="3"> * Obviously, an MST is the same for all the arguments it includes, thus we need a way to differentiate it for each target argument.</Paragraph> <Paragraph position="4"> Again, we can mark the node that exactly covers the target argument as shown in the previous section. We refer to this subtree as marked MST (MMST). However, for large arguments (i.e. spread on a large part of the sentencetree) thesubstructures'likelihoodof being part of other arguments is quite high.</Paragraph> <Paragraph position="5"> To address this latter problem, we can mark all nodes that descend from the target argument node. Figure 2 shows a MST in which the subtree associated with the target argument (AM) has the nodes marked. We refer to this structure as a completely marked MST (CMST). CMSTs may be seen as PAFs enriched with new information coming from the other arguments (i.e. the nonmarked subtrees). Note that if we consider only the PAF subtree from a CMST we obtain a differently marked subtree which we refer to as CPAF. In the next section we study the impact of the proposed kernels on the boundary detection and argument classification performance.</Paragraph> </Section> </Section> <Section position="5" start_page="52" end_page="54" type="metho"> <SectionTitle> 4 Experiments </SectionTitle> <Paragraph position="0"> In these experiments we evaluate the impact of our proposed kernels in terms of accuracy and efficiency. The accuracy improvement confirms that the node marking approach enables the automatic engineering of effective SRL features. The efficiency improvement depends on (a) the less trainingdatausedwhenapplyingtwodistincttypeclas- null sifiersforinternalandpre-terminalnodesand(b)a more adequate feature space which allows SVMs to converge faster to a model containing a smaller number of support vectors, i.e. faster training and classification.</Paragraph> <Section position="1" start_page="52" end_page="52" type="sub_section"> <SectionTitle> 4.1 Experimental set up </SectionTitle> <Paragraph position="0"> The empirical evaluations were carried out within the setting defined in the CoNLL-2005 Shared Task (Carreras and M`arquez, 2005). We used as a target dataset the PropBank corpus available at www.cis.upenn.edu/[?]ace, along with the Penn TreeBank 2 for the gold trees (www.cis.upenn.edu/[?]treebank)(Marcusetal., 1993), which includes about 53,700 sentences.</Paragraph> <Paragraph position="1"> Since the aim of this study was to design a real SRL system we adopted the Charniak parse trees from the CoNLL 2005 Shared Task data (available at www.lsi.upc.edu/[?]srlconll/).</Paragraph> <Paragraph position="2"> We used Section 02, 03 and 24 from the Penn TreeBank in most of the experiments. Their characteristics are shown in Table 1. Pos and Neg indicate the number of nodes corresponding or not to a correct argument boundary. Rows 3 and 4 report such number for the internal and pre-terminal nodes separately. We note that the latter are much fewer than the former; this results in a very fast pre-terminal classifier.</Paragraph> <Paragraph position="3"> As the automatic parse trees contain errors, some arguments cannot be associated with any covering node. This prevents us to extract a tree representation for them. Consequently, we do not consider them in our evaluation. In sections 2, 3 and 24 there are 454, 347 and 731 such cases, respectively. null The experiments were carried out with the SVM-light-TK software available at http://ai-nlp.info.uniroma2.it/moschitti/ which encodes fast tree kernel evaluation (Moschitti, 2006) in the SVM-light software (Joachims, 1999). We used a regularization parameter (option -c) equal to 1 and l = 0.4 (see (Moschitti, 2004)).</Paragraph> </Section> <Section position="2" start_page="52" end_page="53" type="sub_section"> <SectionTitle> 4.2 Boundary Detection Results </SectionTitle> <Paragraph position="0"> In these experiments, we used Section 02 for training and Section 24 for testing. The results using the PAF and the MPAF based kernels are reported in Table 2 in rows 2 and 3, respectively. Columns 3 and 4 show the CPU testing time (in seconds) and the F1 of the monolithic boundary classifier.</Paragraph> <Paragraph position="1"> The next 3 columns show the CPU time for the internal (Int) and pre-terminal (Pre) node classifiers, as well as their total (All). The F1 measures are reported in the 3 rightmost columns. In particular, the third column refers to the F1 of the combined classifier. This has been computed by summing correct, incorrectandnotretrievedexamplesofthe two distinct classifiers.</Paragraph> <Paragraph position="2"> We note that: first, the monolithic classifier applied to MPAF improves both the efficiency, i.e.</Paragraph> <Paragraph position="3"> about 3,131 seconds vs. 5,179, of PAF and the F1, i.e. 82.07 vs. 75.24. This suggests that marking the argument node simplifies the generalization process.</Paragraph> <Paragraph position="4"> Second, by dividing the boundary classification in two tasks, internal and pre-terminal nodes, we furthermore improve the classification time for both PAF and MPAF kernels, i.e. 5,179 vs. 1,851 (PAF) and 3,131 vs. 1,471 (MPAF). The separatedclassifiersaremuchfaster,especiallythepre- null terminal one (about 61 seconds to classify 81,075 nodes).</Paragraph> <Paragraph position="5"> Third, the combined classifier approach seems quitefeasibleasitsF1 isalmostequaltothemonolithic one (81.96 vs. 82.07) in case of MPAF and even superior when using PAF (79.89 vs. 75.34).</Paragraph> <Paragraph position="6"> This result confirms the observation given in Section3.1abouttheimportanceofreducingthenum- null ber of substructures common to PAFs associated with correct and incorrect boundaries.</Paragraph> <Paragraph position="7"> Finally, wetrainedthecombinedboundaryclassifiers with sets of increasing size to derive the learning curves of the PAF and MPAF models.</Paragraph> <Paragraph position="8"> To have more significant results, we increased the training set by using also sections from 03 to 07.</Paragraph> <Paragraph position="9"> Figure 4 shows that the MPAF approach is constantly over the PAF. Consider also that the marking strategy has a lesser impact on the combined classifier.</Paragraph> </Section> <Section position="3" start_page="53" end_page="54" type="sub_section"> <SectionTitle> 4.3 Argument Classification Results </SectionTitle> <Paragraph position="0"> In these experiments we tested different kernels on the argument classification task. As some arguments have a very small number of training instances in a single section, we also used Section 03 for training and we continued to test on only Section 24.</Paragraph> <Paragraph position="1"> The results of the multiclassifiers on 59 argument types2 (e.g. constituted by 59 binary classifiers in the monolithic approach) are reported in whenusingthePAF,MPAFandCPAFwhereasthe rows from 6 to 8 show the accuracy for the complete argument structure approaches, i.e. MST, MMST and CMST.</Paragraph> <Paragraph position="2"> More in detail, Column 2 shows the accuracy of the monolithic multi-argument classifiers whereas Columns 3, 4 and 5 report the accuracy of the internal, pre-terminal and combined multi-argument classifiers, respectively.</Paragraph> <Paragraph position="3"> We note that: First, the two classifier approach does not improve the monolithic approach accuracy. Indeed, the subtrees describing different argument types are quite different and this property holds also for the pre-terminal nodes. However, we still measured a remarkable improvement in efficiency. Second, MPAF is the best kernel. This confirms the outcome on boundary detection experiments. The fact that it is more accurate than CPAF reveals that we need to distin27 for the core arguments (A0...AA), 13 for the adjunct arguments (AM-*), 19 for the argument references (R-*) and 20 for the continuations (C-*).</Paragraph> <Paragraph position="4"> guish the argument node from the other nodes.</Paragraph> <Paragraph position="5"> To explain this, suppose that two argument nodes, NP1 and NP2, dominate the following structures: [NP1 [NP [DT NN]][PP]] and [NP2 [DT NN]]. If we mark only the argument node we obtain [NP-B [NP [DT NN]][PP]] and [NP-B [DT NN]] which have no structure in common. In contrast, if we mark them completely, i.e. [NP-B [NP-B [DT-B NN-B]][PP-B]] and [NP-B [DT-B NN-B]], they will share the subtree [NP-B [DT-B NN-B]]. Thus, although it may seem counterintuitive, by marking only one node, we obtain more specific substructures. Of course, if we use different labels for the argument nodes and their descendants, we obtain the same specialization effect.</Paragraph> <Paragraph position="6"> Finally, if we do not mark the target argument in the MSTs, we obtain a very low result (i.e. 40.10%) as expected. When we mark the covering node or the complete argument subtree we obtain an acceptable accuracy. Unfortunately, such accuracy is lower than the one produced by PAFs, e.g. 73.17% vs. 77.07%, thus it may seem that the additional information provided by the whole argument structure is not effective. A more careful analysis can be carried out by considering a CMST as composed by a PAF and the rest of the argument structure. We observe that some pieces of information provided by a PAF are not derivable by a CMST (or a MMST). For example, Figure 1 shows that the PAF contains the subtree[VP [V NP]]while the associated CMST (see Figure 2) contains [VP [V NP PP]]. The latter structure is larger and more sparse and consequently, the learning machine applied to CMSTs (or MM-STs) performs a more difficult generalization task. This problem is emphasized by our use of the adjuncts in the design of MSTs. As adjuncts tend to be the same for many predicates they do not provide a very discriminative information.</Paragraph> </Section> </Section> class="xml-element"></Paper>