File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-2908_metho.xml
Size: 21,576 bytes
Last Modified: 2025-10-06 14:10:51
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-2908"> <Title>Semantic Role Recognition using Kernels on Weighted Marked Ordered Labeled Trees</Title> <Section position="4" start_page="53" end_page="54" type="metho"> <SectionTitle> 2 Semantic Role Labeling </SectionTitle> <Paragraph position="0"> Semantic role labeling (SRL) recognizes the arguments of a given predicate and assigns the correct role to each argument. For example, the sentence I saw a cat in the park will be labeled as follows with respect to the predicate see .</Paragraph> <Paragraph position="1"> [A0 I] [V saw] [A1 a cat] [AM-LOC in the park] In the example, A0, A1, and AM-LOC are the roles assigned to the arguments. In the CoNLL 2005 dataset, there are the numbered arguments (AX) whose semantics are predicate dependent, the adjuncts (AM-X), and the references (R-X) for relative clauses.</Paragraph> <Paragraph position="2"> Many previous studies employed two-step SRL methods, where (1) we first recognize the arguments, and then (2) classify the argument to the correct role. We also assume this two-step processing and focus on the argument recognition.</Paragraph> <Paragraph position="3"> Given a parse tree, argument recognition can be cast as the classification of tree nodes into two classes, ARG and NO-ARG . Then, we consider the words (a phrase) that are the descendants of an ARG node to be an argument. Since arguments are defined for a given predicate, this classification is the recognition of a relation between the predicate and tree nodes. Thus, we want to build a binary classifier that returns a +1 for correct relations and a -1 for incorrect relations. For the above example, the classifier will output a +1 for the relations indicated by (a), (b), and (c) in Figure 1 and a -1 for the relations between the predicate node and other nodes.</Paragraph> <Paragraph position="4"> Since the task is the classification of trees with node relations, tree kernels for usual ordered labeled trees, such as those proposed by Collins and Duffy (2001) and Kashima and Koyanagi (2002), are not useful. Kazama and Torisawa (2005) proposed to represent a node relation in a tree as a marked ordered labeled tree and presented a kernel for it (MOLT kernel). We adopted the MOLT kernel and extend it for accurate argument recognition.</Paragraph> </Section> <Section position="5" start_page="54" end_page="56" type="metho"> <SectionTitle> 3 Kernels for Argument Recognition </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="54" end_page="54" type="sub_section"> <SectionTitle> 3.1 Kernel-based classification </SectionTitle> <Paragraph position="0"> Kernel-based methods, such as support vector machines (SVMs) (Vapnik, 1995), consider a mapping Ph(x) that maps the object x into a, (usually highdimensional), feature space and learn a classifier in this space. A kernel function K(xi,xj) is a function that calculates the inner product [?]Ph(xi),Ph(xj)[?] in thefeaturespacewithoutexplicitlycomputingPh(x), which is sometimes intractable. Then, any classifier that is represented by using only the inner products between the vectors in a feature space can be re-written using the kernel function. For example, an SVM classifier has the form:</Paragraph> <Paragraph position="2"> where ai and b are the parameters learned in the training. With kernel-based methods, we can construct a powerful classifier in a high-dimensional feature space. In addition, objects x do not need to be vectors as long as a kernel function is defined (e.g., x can be strings, trees, or graphs).</Paragraph> </Section> <Section position="2" start_page="54" end_page="55" type="sub_section"> <SectionTitle> 3.2 MOLT kernel </SectionTitle> <Paragraph position="0"> A marked ordered labeled tree (Kazama and Torisawa, 2005) is an ordered labeled tree in which each node can have a mark in addition to a label. We can encode a k-node relation by using k distinct marks.</Paragraph> <Paragraph position="1"> In this study, we determine an argument node without considering other arguments of the same predicate, i.e., we represent an argument relation as a two-node relation using two marks. For example, the relation (a) in Figure 1 can be represented as the marked ordered labeled tree (a').1</Paragraph> <Paragraph position="3"> where Si is a possible subtree and #Si(Tj) is the number of times Si is included in Tj. The mapping corresponding to this kernel is Ph(T) = ([?]W(S1)#S1(T),*** ,[?]W(SE)#SE(T)), which maps the tree into the feature space of all the possible subtrees.</Paragraph> <Paragraph position="4"> The tree inclusion is defined in many ways. For example, Kashima and Koyanagi (2002) presented the following type of inclusion.</Paragraph> <Paragraph position="5"> 1 DEFINITION S is included in T iff there exists a one-to-one function ps from a node of S to a node of T, such that (i) pa(ps(ni)) = ps(pa(ni)), (ii) ps(ni) followsequal ps(nj) iff ni followsequal nj, , and (iii) l(ps(ni)) = l(ni) (and m(ps(ni)) = m(ni) in the MOLT kernel).</Paragraph> <Paragraph position="6"> See Table 1 for the meaning of each function. This definition means that any subtrees preserving the parent-child relation, the sibling relation, and labelmarks, are allowed. In this paper, we employ this definition, since Kazama and Torisawa (2005) reported that the MOLT kernel with this definition has a higher accuracy than one with the definition presented by Collins and Duffy (2001).</Paragraph> <Paragraph position="7"> W(Si) is the weight of subtree Si. The weighting in Kazama and Torisawa (2005) is written as fol2This notation is slightly different from (Kazama and Torisawa, 2005).</Paragraph> <Paragraph position="9"> where marked(Si) returns true iff marked(ni) = true for at least one node in tree Si. By this weighting, only the subtrees with at least one mark are considered. The idea behind this is that subtrees having no marks are not useful for relation recognition or labeling. l(0 [?] l [?] 1)isafactortopreventthekernel values from becoming too large, which has been used in previous studies (Collins and Duffy, 2001; Kashima and Koyanagi, 2002).</Paragraph> <Paragraph position="10"> Table 2 shows an example of subtree inclusion andtheweightsgiventoeachincludedsubtree. Note that the subtrees are treated differently when the markings are different, even if the labels are the same.</Paragraph> <Paragraph position="11"> Although the dimension of the feature space is exponential, tree kernels can be calculated in O(|T1||T2|) time using dynamic programming (DP) procedures (Collins and Duffy, 2001; Kashima and Koyanagi, 2002). The MOLT kernel also has an O(|T1||T2|) DP procedure (Kazama and Torisawa, 2005).</Paragraph> </Section> <Section position="3" start_page="55" end_page="56" type="sub_section"> <SectionTitle> 3.3 WMOLT kernel </SectionTitle> <Paragraph position="0"> Although Kazama and Torisawa (2005) evaluated the MOLT kernel for SRL, the evaluation was only on the role assignment task and was preliminary. We evaluated the MOLT kernel for argument recognition, andfoundthattheMOLTkernelcannotachieve a high accuracy for argument recognition.</Paragraph> <Paragraph position="1"> The problem is that the MOLT kernel treats sub-trees with one mark and subtrees with two marks equally, although the latter seems to be more important in distinguishing difficult arguments.</Paragraph> <Paragraph position="2"> Consider the sentence, He said industry should build plants . For say , we have the following labeling. null [A0 He] [V said] [A1 industry should build plants] On the other hand, for build , we have He said [A0 industry] [AM-MOD should] [V build] [A1 plants].</Paragraph> <Paragraph position="3"> As can be seen, he is the A0 argument of say , but not an argument of build . Thus, our classifier should return a +1 for the tree where he is marked when the predicate is say , and a -1 when the predicate is build . Although the subtrees around the node for say and build are different, the subtrees around the node for he are identical for both cases. If he is often the A0 argument in the corpus, it is likelythattheclassifierreturnsa+1evenfor build .</Paragraph> <Paragraph position="4"> Although the subtrees containing both the predicate and the argument nodes are considered in the MOLT kernel, theyaregivenrelativelysmallweightsbyEq.</Paragraph> <Paragraph position="5"> (1), since such subtrees are large.</Paragraph> <Paragraph position="6"> Thus, we modify the MOLT kernel so that the mark can be weighted according to its importance and the more marks the subtrees contain, the more weights they get. The modification is simple. We change the definition of W(Si) as follows.</Paragraph> <Paragraph position="8"> where g(m) ([?] 1) is the weight of mark m. We call a kernel with this weight the WMOLT kernel.</Paragraph> <Paragraph position="9"> In this study, we assume g(no-mark) = 1 and</Paragraph> <Paragraph position="11"> where #m(Si) is the number of marked nodes in Si. The last row in Table 2 shows how the subtree weights change by introducing this mark weighting.</Paragraph> <Paragraph position="12"> For the WMOLT kernel, we can derive O(|T1||T2|) DP procedure by slightly modifying the procedure presented by Kazama and Torisawa (2005). The method for speeding up training described in Kazama and Torisawa (2005) can also be applied with a slight modification.</Paragraph> <Paragraph position="14"> We describe this DP procedure in some detail.</Paragraph> <Paragraph position="15"> The key is the use of two DP matrices of size |T1|x|T2|. The first is C(n1,n2) defined as:</Paragraph> <Paragraph position="17"> subtree Si is included in tree Tj with ps(root(Si)) = nk. W'(Si) is defined as W'(Si) = l|Si|g#m(Si).</Paragraph> <Paragraph position="18"> This means that this matrix records the values that ignore whether marked(Si) = true or not. The</Paragraph> <Paragraph position="20"> With these matrices, the kernel is calculated as:</Paragraph> <Paragraph position="22"> sively, starting from the leaves of the trees. The recursive procedure is shown in Algorithm 3.1. See also Table 1 for the meaning of the functions used.</Paragraph> </Section> </Section> <Section position="6" start_page="56" end_page="57" type="metho"> <SectionTitle> 4 Fast Argument Recognition </SectionTitle> <Paragraph position="0"> We use the SVMs for the classifiers in argument recognition in this study and describe the fast classification method based on SVMs.3 We denote a marked ordered labeled tree where node nk of an ordered labeled tree U is marked by mark X, nl by Y , and so on, by U@{nk = X,nl = Y,...}.</Paragraph> <Paragraph position="2"> // actually iterate only on n2 with l(n1) = l(n2) if change(n2) then</Paragraph> <Paragraph position="4"> for nk - 1 to |U |do (nk = nv) diff - FAST-UPDATE(nk), t(nk) - k + diff Given a sentence represented by tree U and the node for the target predicatenv, the argument recognition requires the calculation of: for all nk [?] U (= nv), where SV represents the support vectors. Naively, this requires O(|U |x |SV|x|U||Tj|) time, which is rather costly in practice. null However, if we exploit the fact that U@{nv = *0,nk =*1} is different from U@{nv =*0} at one node, we can greatly speed up the above calculation. At first, we calculate K(U@{nv = *0},Tj) using the DP procedure presented in the previous section, and then calculate K(U@{nv = *0,nk = *1},Tj) using a more efficient DP that updates only the values of the necessary DP cells of the first DP. More specifically, we only need to update the DP cells involving the ancestor nodes of nk.</Paragraph> <Paragraph position="5"> Here we show the procedure for calculating t(nk) = K(U@{nv = *0,nk = *1},Tj) for all nk for a given support vector Tj, which will suffice for calculating s(nk). Algorithm 4.1 shows the procedure. For each nk, this procedure updates at most (nk's depth) x |Tj |cells, which is much less than |U |x |Tj |cells. In addition, when updating the cells for (n1,n2), we only need to update them when the cells for any child of n2 have been updated in the calculation of the cells for the children of n1. To achieve this, change(n2) in the algorithm stores whether the cells of any child of n2 have been updated. This technique will also reduce the number of updated cells.</Paragraph> </Section> <Section position="7" start_page="57" end_page="57" type="metho"> <SectionTitle> 5 Non-overlapping Constraint </SectionTitle> <Paragraph position="0"> Finally, in argument recognition, there is a strong constraint that the arguments for a given predicate do not overlap each other. To enforce this constraint, we employ the approach presented by Toutanova et al. (2005). Given the local classification probability p(nk = Xk) (Xk [?] {ARG,NO-ARG}), this method finds the assignment that maximizes[?] k p(nk = Xk) while satisfying the above non-overlapping constraint, by using a dynamic programming procedure. Since the output of SVMs is not a probability value, in this study we obtain the probability value by converting the output from the SVM, s(nk), using the sigmoid function:4</Paragraph> <Paragraph position="2"/> </Section> <Section position="8" start_page="57" end_page="58" type="metho"> <SectionTitle> 6 Evaluation </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="57" end_page="57" type="sub_section"> <SectionTitle> 6.1 Setting </SectionTitle> <Paragraph position="0"> ing part and divided it into our training, development, and test sets (23,899, 7,966, and 7,967 sentences, respectively). We used the outputs of the Charniak parser provided with the dataset. We also used POS tags, which were also provided, by inserting the nodes labeled by POS tags above the word nodes. The words were downcased.</Paragraph> <Paragraph position="1"> We used TinySVM5 as the implementation of the SVMs, adding the WMOLT kernel. We normalized the kernel as: K(Ti,Tj)/[?]K(Ti,Ti)xK(Tj,Tj).</Paragraph> <Paragraph position="2"> To train the classifiers, for a positive example we used the marked ordered labeled tree that encodes an argument in the training set. Although nodes other than the argument nodes were potentially negative examples, we used 1/5 of these nodes that were randomly-sampled, since the number of such nodes is so large that the training cannot be performed in practice. Note that we ignored the arguments that do not match any node in the tree (the rate of such arguments was about 3.5% in the training set).</Paragraph> </Section> <Section position="2" start_page="57" end_page="58" type="sub_section"> <SectionTitle> 6.2 Effect of mark weighting </SectionTitle> <Paragraph position="0"> We first evaluated the effect of the mark weighting of the WMOLT kernel. For several fixed g, we tunedlandthesoft-marginconstantoftheSVM,C, and evaluated the recognition accuracy. We tested 30 different values of C [?] [0.1...500] for each l [?] [0.05,0.1,0.15,0.2,0.25,0.3]. The tuning was performed using the method for speeding up the training with tree kernels described by Kazama and Torisawa (2005). We conducted the above experiment for several training sizes.</Paragraph> <Paragraph position="1"> Table 3 shows the results. This table shows the best setting of l and C, the performance on the development set with the best setting, and the performance on the test set. The performance is shown in the F1 measure. Note that we treated the region labeled C-k in the CoNLL 2005 dataset as an independent argument.</Paragraph> <Paragraph position="2"> We can see that the mark weighting greatly improves the accuracy over the original MOLT kernel (i.e., g = 1). In addition, we can see that the best setting for g is somewhere around g = 4,000. In this experiment, we could only test up to 1,000 sentences due to the cost of SVM training, which were 5chasen.org/ taku/software/TinySVM empirically O(L2) where L is the number of training examples, regardless of the use of the speed-up method (Kazama and Torisawa, 2005), However, we can observe that the WMOLT kernel achieves a high accuracy even when the training data is very small.</Paragraph> </Section> <Section position="3" start_page="58" end_page="58" type="sub_section"> <SectionTitle> 6.3 Effect of non-overlapping constraint </SectionTitle> <Paragraph position="0"> Additionally, we observed how the accuracy changes when we do not use the method described in Section 5 and instead consider the node to be an argument when s(nk) > 0. The last row in Table 3 shows the accuracy for the model obtained with g = 4,000. We could observe that the non-overlapping constraint also improves the accuracy.</Paragraph> </Section> <Section position="4" start_page="58" end_page="58" type="sub_section"> <SectionTitle> 6.4 Recognition speed-up </SectionTitle> <Paragraph position="0"> Next, we examined the method for fast argument recognition described in Section 4. Using the classifiers with g = 4,000, we measured the time required for recognizing the arguments for 200 sentences with the naive classification of Eq. (2) and with the fast update procedure shown in Algorithm 4.1. The time was measured using a computer with 2.2-GHz dual-core Opterons and 8-GB of RAM.</Paragraph> <Paragraph position="1"> Table 4 shows the results. We can see a constant speed-up by a factor of more than 40, although the time was increased for both methods as the size of the training data increases (due to the increase in the number of support vectors).</Paragraph> </Section> <Section position="5" start_page="58" end_page="58" type="sub_section"> <SectionTitle> 6.5 Evaluation on CoNLL 2005 evaluation set </SectionTitle> <Paragraph position="0"> To compare the performance of our system with other systems, we conducted the evaluation on the official evaluation set of the CoNLL 2005 shared task. We used a model trained using 2,000 sentences (57,547 examples) with (g = 4,000,l = 0.2,C = 12.04), the best setting in the previous experiments. This is the largest model we have successfully trained so far, and has F1 = 88.00 on the test set in the previous experiments.</Paragraph> <Paragraph position="1"> The accuracy of this model on the official evaluation set was F1 = 79.96 using the criterion from the previous experiments where we treated a C-k argument as an independent argument. The official evaluation script returned F1 = 78.22. This difference is caused because the official script takes C-k arguments into consideration, while our system cannot output C-k labels since it is just an argument recognizer. Therefore, the performance will become slightly higher than F1 = 78.22 if we perform the role assignment step. However, our current system is worse than the systems reported in the CoNLL 2005 shared task in any case, since it is reported that they had F1 = 79.92 to 83.78 argument recognition accuracy (Carreras and M`arquez, 2005).</Paragraph> </Section> </Section> <Section position="9" start_page="58" end_page="59" type="metho"> <SectionTitle> 7 Discussion </SectionTitle> <Paragraph position="0"> Although we have improved the accuracy by introducingtheWMOLTkernel, theaccuracyfortheofficial evaluation set was not satisfactory. One possible reason is the accuracy of the parser. Since the Charniak parser is trained on the same set with the training set of the CoNLL 2005 shared task, the parsing accuracy is worse for the official evaluation set than for the training set. For example, the rate of the arguments that do not match any node of the parse tree is 3.93% for the training set, but 8.16% for the evaluation set. This, to some extent, explains why our system, which achieved F1 = 88.00 for our test set, could only achieved F1 = 79.96. To achieve a higher accuracy, we need to make the system more robust to parsing errors. Some of the non-matching arguments are caused by incorrect treatment of quotation marks and commas. These errors seem to be solved by using simple pre-processing. Other major non-matching arguments are caused by PP attachment errors. To solve these errors, we need to explore more, such as using n-best parses and the use of several syntactic views (Pradhan et al., 2005b).</Paragraph> <Paragraph position="1"> Another reason for the low accuracy is the size of the training data. In this study, we could train the SVM with 2,000 sentences (this took more than 30 hours including the conversion of trees), but this is a very small fraction of the entire training set. We needtoexplorethemethodsforincorporatingalarge training set within a reasonable training time. For example, the combination of small SVMs (Shen et al., 2003) is a possible direction.</Paragraph> <Paragraph position="2"> The contribution of this study is not the accuracy achieved. The first contribution is the demonstration of the drastic effect of the mark weighting. We will exploremoreaccuratekernelsbasedontheWMOLT kernel. For example, we are planning to use different weights depending on the marks. The second contribution is the method of speeding-up argument recognition. This is of great importance, since the proposed method can be applied to other tasks where all nodes in a tree should be classified. In addition, this method became possible because of the WMOLT kernel, and it is hard to apply to Moschitti and Bejan (2004) where the tree structure changes during recognition. Thus, the architecture that uses the WMOLT kernel is promising, if we assume further progress is possible with the kernel design.</Paragraph> </Section> class="xml-element"></Paper>