XML Viewer - w06-2909

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-2909_metho.xml
Size: 24,154 bytes
Last Modified: 2025-10-06 14:10:51
<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-2909">
  <Title>Semantic Role Labeling via Tree Kernel Joint Inference</Title>
  <Section position="4" start_page="61" end_page="64" type="metho">
    <SectionTitle>
2 Semantic Role Labeling
</SectionTitle>
    <Paragraph position="0"> In the last years, several machine learning approaches have been developed for automatic role labeling, e.g. (Gildea and Jurasfky, 2002; Pradhan et al., 2005a). Their common characteristic is the adoption of attribute-value representations for predicate-argument structures. Accordingly, our basic system is similar to the one proposed in (Pradhan et al., 2005a) and it is hereby described.</Paragraph>
    <Paragraph position="1"> We use a boundary detection classifier (for any role type) to derive the words compounding an argument and a multiclassifier to assign the roles (e.g. Arg0 or ArgM) described in PropBank (Kingsbury and Palmer, 2002)). To prepare the training data for both classifiers, we used the following algorithm:  1. Given a sentence from the training-set, generate a full syntactic parse tree; 2. Let P and A be respectively the set of predicates and the set of parse-tree nodes (i.e. the potential arguments); null 3. For each pair &lt;p,a&gt; [?]P xA: - extract the feature representation set, Fp,a; - if the subtree rooted in a covers exactly the  words of one argument of p, put Fp,a in the T+ set (positive examples), otherwise put it in the T[?] set (negative examples).</Paragraph>
    <Paragraph position="2"> The outputs of the above algorithm are the T+ and T[?] sets. These sets can be directly used to train a boundary classifier (e.g. an SVM). Regarding the argument type classifier, a binary labeler for a role r (e.g. an SVM) can be trained on the T+r , i.e. its positive examples and T[?]r , i.e. its negative examples, where T+ = T+r [?] T[?]r , according to the ONE-vs-ALL scheme. The binary classifiers are then used to build a general role multiclassifier by simply selecting the argument associated with the maximum among the SVM scores.</Paragraph>
    <Paragraph position="3"> Regarding the design of features for predicate-argument pairs, we can use the attribute-values defined in (Gildea and Jurasfky, 2002) or tree structures (Moschitti, 2004). Although we focus on the latter approach, a short description of the former is still relevant as they are used by TBC and TRC. They include the Phrase Type, Predicate Word, Head Word, Governing Category, Position and Voice features. For example, the Phrase Type indicates the syntactic type of the phrase labeled as a predicate argument and the Parse Tree Path contains the path in the parse tree between the predicate and the argument phrase, expressed as a sequence of nonterminal labels linked by direction (up or down) symbols, e.g. V|VP|NP.</Paragraph>
    <Paragraph position="4"> A viable alternative to manual design of syntactic features is the use of tree-kernel functions. These implicitly define a feature space based on all possible tree substructures. Given two trees T1 and T2, instead of representing them with the whole fragment space, we can apply the kernel function to evaluate the number of common fragments.</Paragraph>
    <Paragraph position="5"> Formally, given a tree fragment space F = {f1,f2,...,f|F|}, the indicator function Ii(n) is equal to 1 if the target fi is rooted at node n and equal to 0 otherwise. A tree-kernel function over t1 and t2 is Kt(t1,t2) =summationtext  ments. When l = 1, [?] is equal to the number of common fragments rooted at nodes n1 and n2. As described in (Collins and Duffy, 2002), [?] can be computed in O(|Nt1|x|Nt2|).</Paragraph>
    <Paragraph position="6"> 3 Tree kernel-based classification of  Traditional semantic role labeling systems extract features from pairs of nodes corresponding to a predicate and one of its argument, respectively.</Paragraph>
    <Paragraph position="7"> Thus, they focus on only binary relations to make classification decisions. This information is poorer than the one expressed by the whole predicate argument structure. As an alternative we can select the set of potential arguments (potential argument nodes) of a predicate and extract features from them. The number of the candidate argument sets is exponential, thus we should consider only those corresponding to the most probable correct argument structures.</Paragraph>
    <Paragraph position="8"> The usual approach (Toutanova et al., 2005) uses a traditional boundary classifier (TBC) to select the set of potential argument nodes. Such set can be associated with a subtree which in turn can be classified by means of a tree kernel function. This function intuitively measures to what extent a given candidate subtree is compatible with the subtree of a correct predicate argument structure. We can use it to define two different learning problems: (a) the simple classification of correct and incorrect predicate argument structures and (b) given the best m structures, we can train a re-ranker algorithm able to exploit argument inter-dependencies.</Paragraph>
    <Section position="1" start_page="61" end_page="63" type="sub_section">
      <SectionTitle>
3.1 The Argument Spanning Trees (ASTs)
</SectionTitle>
      <Paragraph position="0"> We consider predicate argument structures annotated in PropBank along with the corresponding TreeBank data as our object space. Given the target predicate node p and a node subset s = {n1,..,nk} of the parse tree t, we define as the spanning tree root r the lowest common ancestor of n1,..,nk and p. The node set spanning tree (NST) ps is the sub-tree of t rooted in r from which the nodes that are neither ancestors nor descendants of any ni or p are removed.</Paragraph>
      <Paragraph position="1"> Since predicate arguments are associated with tree nodes (i.e. they exactly fit into syntactic constituents), we can define the Argument Spanning Tree (AST) of a predicate argument set, {p,{a1,..,an}}, as the NST over such nodes, i.e. p{a1,..,an}. An AST corresponds to the minimal subtree whose leaves are all and only the words compounding the arguments and the predicate. For example, Figure 1 shows the parse tree of the sentence &amp;quot;John took the book and read its title&amp;quot;. took{Arg0,Arg1} and read{Arg0,Arg1} are two AST structures associated with the two predicates took and read, respectively. All the other possible subtrees, i.e. NSTs, are not valid ASTs for these two predicates. Note that classifying ps in AST or NST for each node subset s of t is equivalent to solve the boundary detection problem.</Paragraph>
      <Paragraph position="2"> The critical points for the AST classification are: (1) how to design suitable features for the characterization of valid structures. This requires a careful linguistic investigation about their significant properties. (2) How to deal with the exponential number of NSTs.</Paragraph>
      <Paragraph position="3"> The first problem can be addressed by means of tree kernels over the ASTs. Tree kernel spaces are an alternative to the manual feature design as the learning machine, (e.g. SVMs) can select the most relevant features from a high dimensional space. In other words, we can use a tree kernel function to estimate the similarity between two ASTs (see Sec- null Ord labeling tion 2), hence avoiding to define explicit features. The second problem can be approached in two ways:  (1) We can increase the recall of TBC to enlarge the set of candidate arguments. From such set, we can extract correct and incorrect argument structures. As the number of such structures will be rather small, we can apply the AST classifier to detect the correct ones.</Paragraph>
      <Paragraph position="4"> (2) We can consider the classification probability provided by TBC and TRC (Pradhan et al., 2005a) and select the m most probable structures. Then, we  can apply a re-ranking approach based on SVMs and tree kernels.</Paragraph>
      <Paragraph position="5"> The re-ranking approach is the most promising one as suggested in (Toutanova et al., 2005) but it does not clearly reveal if tree kernels can be used to learn the difference between correct or incorrect argument structures. Thus it is interesting to study both the above approaches.</Paragraph>
    </Section>
    <Section position="2" start_page="63" end_page="63" type="sub_section">
      <SectionTitle>
3.2 NST Classification
</SectionTitle>
      <Paragraph position="0"> As we cannot classify all possible candidate argument structures, we apply the AST classifier just to detect the correct structures from a set of overlapping arguments. Given two nodes n1 and n2 of an NST, they overlap if either n1 is ancestor of n2 or vice versa. NSTs that contain overlapping nodes are not valid ASTs but subtrees of NSTs may be valid ASTs. Assuming this, we define s as the set of potential argument nodes and we create two node sets s1 = s[?]{n1} and s2 = s[?]{n2}. By classifying the two new NSTs ps1 and ps2 with the AST classifier, we can select the correct structures. Of course, this procedure can be generalized to a set of overlapping nodes greater than 2. However, considering that the Precision of TBC is generally high, the number of overlapping nodes is usually small.</Paragraph>
      <Paragraph position="1"> Figure 2 shows a working example of the multi-stage classifier. In Frame (a), TBC labels as potential arguments (circled nodes) three overlapping nodes related to Arg1. This leads to two possible non-overlapping solutions (Frame (b)) but only the first one is correct. In fact, according to the second one the propositional phrase &amp;quot;of the book&amp;quot; would be incorrectly attached to the verbal predicate, i.e. in contrast with the parse tree. The AST classifier, applied to the two NSTs, is expected to detect this inconsistency and provide the correct output.</Paragraph>
    </Section>
    <Section position="3" start_page="63" end_page="64" type="sub_section">
      <SectionTitle>
3.3 Re-ranking NSTs with Tree Kernels
</SectionTitle>
      <Paragraph position="0"> To implement the re-ranking model, we follow the approach described in (Toutanova et al., 2005).</Paragraph>
      <Paragraph position="1"> First, we use SVMs to implement the boundary TBC and role TRC local classifiers. As SVMs do not provide probabilistic output, we use the Platt's algorithm (Platt, 2000) and its revised version (Lin et al., 2003) to trasform scores into probabilities.</Paragraph>
      <Paragraph position="2"> Second, we combine TBC and TRC probabilities to obtain the m most likely sequences s of tree nodes annotated with semantic roles. As argument constituents of the same verb cannot overlap, we generate sequences that respect such node constraint. We adopt the same algorithm described in (Toutanova et al., 2005). We start from the leaves and we select the m sequences that respect the constraints and at the same time have the highest joint probability of TBC and TRC.</Paragraph>
      <Paragraph position="3"> Third, we extract the following feature representation: null (a) The ASTs associated with the predicate argument structures. To make faster the learning process and to try to only capture the most relevant features, we also experimented with a compact version of the  AST which is pruned at the level of argument nodes.</Paragraph>
      <Paragraph position="4"> (b) Attribute value features (standard features) related to the whole predicate structure. These include the features for each arguments (Gildea and Jurasfky, 2002) and global features like the sequence of argument labels, e.g. &lt;Arg0,Arg1,ArgM&gt; .</Paragraph>
      <Paragraph position="5"> Finally, we prepare the training examples for the re-ranker considering the m best annotations of each predicate structure. We use the approach adopted in (Shen et al., 2003), which generates all possible pairs from the m examples, i.e. parenleftbigm2parenrightbig pairs. Each pair is assigned to a positive example if the first member of the pair has a higher score than the second member. The score that we use is the F1 measure of the annotated structure with respect to the gold standard. More in detail, given training/testing examples ei = &lt;t1i,t2i,v1i,v2i&gt; , where t1i and t2i are two ASTs and v1i and v2i are two feature vectors associated with two candidate predicate structures s1 and s2, we define the following kernels:</Paragraph>
      <Paragraph position="7"> where tji is the j-th AST of the pair ei, Kt is the tree kernel function defined in Section 2 and i,j [?]</Paragraph>
      <Paragraph position="9"> where vji is the j-th feature vector of the pair ei and Kp is the polynomial kernel applied to such vectors.</Paragraph>
      <Paragraph position="10"> The final kernel that we use for re-ranking is the following:</Paragraph>
      <Paragraph position="12"> Regarding tree kernel feature engineering, the next section show how we can generate more effective features given an established kernel function.</Paragraph>
    </Section>
    <Section position="4" start_page="64" end_page="64" type="sub_section">
      <SectionTitle>
3.4 Tree kernel feature engineering
</SectionTitle>
      <Paragraph position="0"> Consider the Frame (b) of Figure 2, it shows two perfectly identical NSTs, consequently, their fragments will also be equal. This prevents the algorithm to learn something from such examples. To solve the problem, we can enrich the NSTs by marking their argument nodes with a progressive number, starting from the leftmost argument. For example, in the first NST of Frame (c), we mark as NP-0 and NP-1 the first and second argument nodes whereas in the second NST we trasform the three argument node labels in NP-0, NP-1 and PP-2. We will refer to the resulting structure as a AST-Ord (ordinal number).</Paragraph>
      <Paragraph position="1"> This simple modification allows the tree kernel to generate different argument structures for the above NSTs. For example, from the first NST in Figure 2.c, the fragments [NP-1 [NP][PP]], [NP [DT][NN]] and [PP [IN][NP]] are generated. They do not match anymore with the [NP-0 [NP][PP]], [NP-1 [DT][NN]] and [PP-2 [IN][NP]] fragments generated from the second NST in Figure 2.c.</Paragraph>
      <Paragraph position="2"> Additionally, it should be noted that the semantic information provided by the role type can remarkably help the detection of correct or incorrect predicate argument structures. Thus, we can enrich the argument node label with the role type, e.g. the NP-0 and NP-1 of the correct AST of Figure 2.c become NP-Arg0 and NP-Arg1 (not shown in the figure).</Paragraph>
      <Paragraph position="3"> We refer to this structure as AST-Arg. Of course, to apply the AST-Arg classifier, we need that TRC labels the arguments detected by TBC.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="64" end_page="66" type="metho">
    <SectionTitle>
4 The experiments
</SectionTitle>
    <Paragraph position="0"> The experiments were carried out within the setting defined in the CoNLL-2005 Shared Task (Carreras and M`arquez, 2005). In particular, we adopted the Charniak parse trees available at www.lsi.upc.edu/[?]srlconll/ along with the official performance evaluator.</Paragraph>
    <Paragraph position="1"> All the experiments were performed with the SVM-light-TK software available at http://ai-nlp.info.uniroma2.it/moschitti/ which encodes ST and SST kernels in SVM-light (Joachims, 1999). For TBC and TRC, we used the linear kernel with a regularization parameter (option -c) equal to 1. A cost factor (option -j) of 10 was adopted for TBC to have a higher Recall, whereas for TRC, the cost factor was parameterized according to the maximal accuracy of each argument class on the validation set. For the AST-based classifiers we used a l equal to 0.4 (see (Moschitti, 2004)).</Paragraph>
    <Section position="1" start_page="65" end_page="66" type="sub_section">
      <SectionTitle>
4.1 Classification of whole predicate argument
structures
</SectionTitle>
      <Paragraph position="0"> In these experiments, we trained TBC on sections 02-08 whereas, to achieve a very accurate role classifier, we trained TRC on all sections 02-21. To train the AST, AST-Ord (AST with ordinal numbers in the argument nodes), and AST-Arg (AST with argument type in the argument nodes) classifiers, we applied the TBC and TRC over sections 09-20. Then, we considered all the structures whose automatic annotation showed at least an argument overlap. From these, we extracted 30,220 valid ASTs and 28,143 non-valid ASTs, for a total of 183,642 arguments.</Paragraph>
      <Paragraph position="1"> First, we evaluate the accuracy of the AST-based classifiers by extracting 1,975 ASTs and 2,220 non-ASTs from Section 21 and the 2,159 ASTs and 3,461 non-ASTs from Section 23. The accuracy derived on Section 21 is an upperbound for our classifiers since it is obtained using an ideal syntactic parser (the Charniak's parser was trained also on Section 21) and an ideal role classifier.</Paragraph>
      <Paragraph position="2"> Table 1 shows Precision, Recall and F1 measures of the AST-based classifiers over the above NSTs. Rows 2, 3 and 4 report the performance of AST, AST-Ord, and AST-Arg classifiers, respectively. We note that: (a) The impact of parsing accuracy is shown by the gap of about 6% points between sections 21 and 23. (b) The ordinal numbering of arguments (Ord) and the role type information (Arg) provide tree kernels with more meaningful fragments since they improve the basic model of about 4%. (c) The deeper semantic information generated by the Arg labels provides useful clues to select correct predicate argument structures since it improves the Ord model on both sections.</Paragraph>
      <Paragraph position="3"> Second, we measured the impact of the AST-based classifiers on the accuracy of both phases of semantic role labeling. Table 2 reports the results on sections 21 and 23. For each of them, Precision, Recall and F1 of different approaches to boundary identification (bnd) and to the complete task, i.e. boundary and role classification (bnd+class) are shown. Such approaches are based on different strategies to remove the overlaps, i.e. with the AST, AST-Ord and AST-Arg classifiers and using the baseline (RND), i.e. a random selection of non-overlapping structures. The baseline corresponds to the system based on TBC and TRC1.</Paragraph>
      <Paragraph position="4"> We note that: (a) for any model, the boundary detection F1 on Section 21 is about 10 points higher than the F1 on Section 23 (e.g. 87.0% vs. 77.9% for RND). As expected the parse tree quality is very important to detect argument boundaries. (b) On the real test (Section 23) the classification introduces labeling errors which decrease the accuracy of about 5% (77.9 vs 72.9 for RND). (c) The Ord and Arg approaches constantly improve the baseline F1 of about 1%. Such poor impact does not surprise as the overlapping structures are a small percentage of the test set, thus the overall improvement cannot be very high.</Paragraph>
      <Paragraph position="5"> Third, the comparison with the CoNLL 2005 results (Carreras and M`arquez, 2005) can only be carried out with respect to the whole SRL task (bnd+class in table 2) since boundary detection versus role classification is generally not provided in CoNLL 2005. Moreover, our best global result, i.e.</Paragraph>
      <Paragraph position="6"> 73.9%, was obtained under two severe experimental factors: a) the use of just 1/3 of the available training set, and b) the usage of the linear SVM model for the TBC classifier, which is much faster than the polynomial SVMs but also less accurate. However, we note the promising results of the AST metaclassifier, which can be used with any of the best figure CoNLL systems.</Paragraph>
      <Paragraph position="7"> Finally, the overall results suggest that the tree kernel model is robust to parse tree errors since preserves the same improvement across trees derived with different accuracy, i.e. the semi-automatic trees of Section 21 and the automatic tree of Section 23.</Paragraph>
      <Paragraph position="8"> Moreover, it shows a high accuracy for the classification of correct and incorrect ASTs. This last property is quite interesting as the best SRL systems 1We needed to remove the overlaps from the baseline outcome in order to apply the CoNLL evaluator.</Paragraph>
      <Paragraph position="9">  (Punyakanok et al., 2005; Toutanova et al., 2005; Pradhan et al., 2005b) were obtained by exploiting the information on the whole predicate argument structure.</Paragraph>
      <Paragraph position="10"> Next section shows our preliminary experiments on re-ranking using the AST kernel based approach.</Paragraph>
    </Section>
    <Section position="2" start_page="66" end_page="66" type="sub_section">
      <SectionTitle>
4.2 Re-ranking based on Tree Kernels
</SectionTitle>
      <Paragraph position="0"> In these experiments, we used the output of TBC and TRC2 to provide an SVM tree kernel with a ranked list of predicate argument structures. More in detail, we applied a Viterbi-like algorithm to generate the 20 most likely annotations for each predicate structure, according to the joint probabilistic model of TBC and TRC. We sorted such structures based on their F1 measure and used them to learn the SVM re-ranker described in 3.3.</Paragraph>
      <Paragraph position="1"> For training, we used Sections 12, 14, 15, 16 and 24, which contain 24,729 predicate structures.</Paragraph>
      <Paragraph position="2"> For each of them, we considered the 5 annotations having the highest F1 score (i.e. 123,674 NSTs) on the span of the 20 best annotations provided by Viterbi algorithm. With such structures, we obtained 294,296 pairs used to train the SVM-based re-ranker. As the number of such structures is very large the SVM training time was very high. Thus, we sped up the learning process by using only the ASTs associated with the core arguments. From the test sentences (which contain 5,267 structures), we extracted the 20 best Viterbi annotated structures, i.e. 102,343 (for a total of 315.531 pairs), which were used for the following experiments: First, we selected the best annotation (according to the F1 provided by the gold standard annotations) out of the 20 provided by the Viterbi's algorithm.</Paragraph>
      <Paragraph position="3"> The resulting F1 of 88.59% is the upperbound of our approach.</Paragraph>
      <Paragraph position="4"> Second, we selected the top ranked annotation indicated by the Viterbi's algorithm. This provides our baseline F1 measure, i.e. 75.91%. Such outcome is slightly higher than our official CoNLL result (Moschitti et al., 2005) obtained without converting SVM scores into probabilities.</Paragraph>
      <Paragraph position="5"> Third, we applied the SVM re-ranker to select 2With the aim of improving the state-of-the-art, we applied the polynomial kernel for all basic classifiers, at this time. We used the models developed during our participation to the CoNLL 2005 shared task (Moschitti et al., 2005).</Paragraph>
      <Paragraph position="6"> the best structures according to the core roles. We achieved 80.68% which is practically equal to the result obtained in (Punyakanok et al., 2005; Carreras and M`arquez, 2005) for core roles, i.e. 81%. Their overall F1 which includes all the arguments was 79.44%. This confirms that the classification of the non-core roles is more complex than the other arguments.</Paragraph>
      <Paragraph position="7"> Finally, the high computation time of the re-ranker prevented us to use the larger structures which include all arguments. The major complexity issue was the slow training and classification time of SVMs. The time needed for tree kernel function was not so problematic as we could use the fast evaluation proposed in (Moschitti, 2006). This roughly reduces the computation time to the one required by a polynomial kernel. The real burden is therefore the learning time of SVMs that is quadratic in the number of training instances. For example, to carry out the re-ranking experiments required approximately one month of a 64 bits machine (2.4 GHz and 4Gb Ram). To solve this problem, we are going to study the impact on the accuracy of fast learning algorithms such as the Voted Perceptron.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML