File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/w05-0407_metho.xml

Size: 20,319 bytes

Last Modified: 2025-10-06 14:09:53

<?xml version="1.0" standalone="yes"?>
<Paper uid="W05-0407">
  <Title>Engineering of Syntactic Features for Shallow Semantic Parsing</Title>
  <Section position="3" start_page="48" end_page="50" type="metho">
    <SectionTitle>
2 Automated Semantic Role Labeling
</SectionTitle>
    <Paragraph position="0"> One of the largest resources of manually annotated predicate argument structures has been developed in the PropBank (PB) project. The PB corpus contains 300,000 words annotated with predicative information on top of the Penn Treebank 2 Wall Street Journal texts. For any given predicate, the expected arguments are labeled sequentially from Arg0 to Arg9, ArgA and ArgM. Figure 1 shows an example of the PB predicate annotation of the sentence: John rented a room in Boston.</Paragraph>
    <Paragraph position="1"> Predicates in PB are only embodied by verbs whereas most of the times Arg0 is the subject, Arg1 is the direct object and ArgM indicates locations, as in our example.</Paragraph>
    <Paragraph position="2">  Several machine learning approaches for automatic predicate argument extraction have been developed, e.g. (Gildea and Jurasfky, 2002; Gildea and Palmer, 2002; Gildea and Hockenmaier, 2003; Pradhan et al., 2004). Their common characteristic is the adoption of feature spaces that model predicate-argument structures in a flat feature representation. In the next section, we present the common parse tree-based approach to this problem.</Paragraph>
    <Section position="1" start_page="48" end_page="49" type="sub_section">
      <SectionTitle>
2.1 Predicate Argument Extraction
</SectionTitle>
      <Paragraph position="0"> Given a sentence in natural language, all the predicates associated with the verbs have to be identified along with their arguments. This problem is usually divided in two subtasks: (a) the detection of the target argument boundaries, i.e. the span of its words in the sentence, and (b) the classification of the argument type, e.g. Arg0 or ArgM in PropBank or Agent and Goal in FrameNet.</Paragraph>
      <Paragraph position="1"> The standard approach to learn both the detection and the classification of predicate arguments is summarized by the following steps:  1. Given a sentence from the training-set, generate a full syntactic parse-tree; 2. let P and A be the set of predicates and the set of parse-tree nodes (i.e. the potential arguments), respectively; 3. for each pair &lt; p,a &gt;[?]P xA: * extract the feature representation set, Fp,a;  * if the subtree rooted in a covers exactly the words of one argument of p, put Fp,a in T+ (positive examples), otherwise put it in T[?] (negative examples).</Paragraph>
      <Paragraph position="2"> For instance, in Figure 1, for each combination of the predicate rent with the nodes N, S, VP, V, NP, PP, D or IN the instances Frent,a are generated. In case the node a exactly covers &amp;quot;John&amp;quot;, &amp;quot;a room&amp;quot; or &amp;quot;in Boston&amp;quot;, it will be a positive instance otherwise it will be a negative one, e.g. Frent,IN.</Paragraph>
      <Paragraph position="3"> The T+ and T[?] sets are used to train the boundary classifier. To train the multi-class classifier T+ can be reorganized as positive T+argi and negative T[?]argi examples for each argument i. In this way, an individual ONE-vs-ALL classifier for each argument i can be trained. We adopted this solution, according to (Pradhan et al., 2004), since it is simple and effective. In the classification phase, given an unseen sentence, all its Fp,a are generated and classified by each individual classifier Ci. The argument associated with the maximum among the scores provided by the individual classifiers is eventually selected. null</Paragraph>
    </Section>
    <Section position="2" start_page="49" end_page="49" type="sub_section">
      <SectionTitle>
2.2 Standard feature space
</SectionTitle>
      <Paragraph position="0"> The discovery of relevant features is, as usual, a complex task. However, there is a common consensus on the set of basic features. These standard features, firstly proposed in (Gildea and Jurasfky, 2002), refer to unstructured information derived from parse trees, i.e. Phrase Type, Predicate Word, Head Word, Governing Category, Position and Voice. For example, the Phrase Type indicates the syntactic type of the phrase labeled as a predicate argument, e.g. NP for Arg1 in Figure 1. The Parse Tree Path contains the path in the parse tree between the predicate and the argument phrase, expressed as a sequence of nonterminal labels linked by direction (up or down) symbols, e.g. V|VP|NP for Arg1 in  the verbal predicate, e.g. rent for all arguments.</Paragraph>
      <Paragraph position="1"> In the next section we describe the SVM approach and the basic kernel theory for the predicate argument classification.</Paragraph>
      <Paragraph position="2"> 3 Learning predicate structures via Support Vector Machines Given a vector space in Rfracturn and a set of positive and negative points, SVMs classify vectors according to a separating hyperplane, H(vectorx) = vectorw x vectorx + b = 0, where vectorw [?] Rfracturn and b [?] Rfractur are learned by applying the Structural Risk Minimization principle (Vapnik, 1995).</Paragraph>
      <Paragraph position="3"> To apply the SVM algorithm to Predicate Argument Classification, we need a function ph : F -Rfracturn to map our features space F = {f1,..,f|F|} and our predicate/argument pair representation, Fp,a = Fz, into Rfracturn, such that:</Paragraph>
      <Paragraph position="5"> From the kernel theory we have that:</Paragraph>
      <Paragraph position="7"> aiph(Fi)*ph(Fz)+b.</Paragraph>
      <Paragraph position="8"> where, Fi [?]i [?] {1,..,l} are the training instances and the product K(Fi,Fz) =&lt;ph(Fi)*ph(Fz)&gt; is the kernel function associated with the mapping ph. The simplest mapping that we can apply is ph(Fz) = vectorz = (z1,...,zn) where zi = 1 if fi [?] Fz and zi = 0 otherwise, i.e. the characteristic vector of the set Fz with respect to F. If we choose the scalar product as a kernel function we obtain the linear kernel KL(Fx,Fz) = vectorx*vectorz.</Paragraph>
      <Paragraph position="9"> An interesting property is that we do not need to evaluate the ph function to compute the above vector. Only the K(vectorx,vectorz) values are in fact required. This allows us to derive efficient classifiers in a huge (possible infinite) feature space, provided that the kernel is processed in an efficient way. This property is also exploited to design convolution kernel like those based on tree structures.</Paragraph>
    </Section>
    <Section position="3" start_page="49" end_page="50" type="sub_section">
      <SectionTitle>
3.1 The tree kernel function
</SectionTitle>
      <Paragraph position="0"> The main idea of the tree kernels is the modeling of a KT(T1,T2) function which computes the number of common substructures between two trees T1 and T2.</Paragraph>
      <Paragraph position="1"> Given the set of substructures (fragments) {f1,f2,..} = F extracted from all the trees of the training set, we define the indicator function Ii(n)  which is equal 1 if the target fi is rooted at node n and 0 otherwise. It follows that:</Paragraph>
      <Paragraph position="3"> where NT1 and NT2 are the sets of the T1's and T2's nodes, respectively and [?](n1,n2) =summationtext |F| i=1 Ii(n1)Ii(n2). This latter is equal to the number of common fragments rooted at the n1 and n2 nodes. We can compute [?] as follows:  1. if the productions at n1 and n2 are different then [?](n1,n2) = 0; 2. if the productions at n1 and n2 are the same, and n1 and n2 have only leaf children (i.e. they are pre-terminals symbols) then [?](n1,n2) = 1; 3. if the productions at n1 and n2 are the same, and n1 and n2 are not pre-terminals then</Paragraph>
      <Paragraph position="5"> where nc(n1) is the number of the children of n1 and cjn is the j-th child of the node n. Note that, as the productions are the same, nc(n1) = nc(n2).</Paragraph>
      <Paragraph position="6"> The above kernel has the drawback of assigning higher weights to larger structures1. In order to overcome this problem we scale the relative importance of the tree fragments imposing a parameter l in conditions 2 and 3 as follows: [?](nx,nz) = l and [?](nx,nz) = lproducttextnc(nx)j=1 (1+[?](cjn1,cjn2)).</Paragraph>
      <Paragraph position="7"> 1In order to approach this problem and to map similarity scores in the [0,1] range, a normalization in the kernel space,</Paragraph>
      <Paragraph position="9"/>
    </Section>
  </Section>
  <Section position="4" start_page="50" end_page="52" type="metho">
    <SectionTitle>
4 Boundary detection via argument
</SectionTitle>
    <Paragraph position="0"> spanning Section 2 has shown that traditional argument boundary classifiers rely only on features extracted from the current potential argument node. In order to take into account a complete argument structure information, the classifier should select a set of parse-tree nodes and consider them as potential arguments of the target predicate. The number of all possible subsets is exponential in the number of the parse-tree nodes of the sentence, thus, we need to cut the search space. For such purpose, a traditional boundary classifier can be applied to select the set of potential arguments PA. The reduced number of PAsubsets can be associated with sentence subtrees which in turn can be classified by using tree kernel functions. These measure if a subtree is compatible or not with the subtree of a correct predicate argument structure.</Paragraph>
    <Section position="1" start_page="50" end_page="52" type="sub_section">
      <SectionTitle>
4.1 The Predicate Argument Spanning Trees
</SectionTitle>
      <Paragraph position="0"> (PASTs) We consider the predicate argument structures annotated in PropBank along with the corresponding TreeBank data as our object space. Given the target predicate p in a sentence parse tree T and a subset s = {n1,..,nk} of the T's nodes, NT , we define as the spanning tree root r the lowest common ancestor of n1,..,nk. The node spanning tree (NST), ps is the subtree rooted in r, from which the nodes that are neither ancestors nor descendants of any ni are removed.</Paragraph>
      <Paragraph position="1"> Since predicate arguments are associated with tree nodes, we can define the predicate argu- null ment spanning tree (PAST) of a predicate argument set, {a1,..,an}, as the NST over such nodes, i.e. p{a1,..,an}. A PAST corresponds to the minimal subparse tree whose leaves are all and only the word sequence compounding the arguments. For example, Figure 2 shows the parse tree of the sentence &amp;quot;John took the book and read its title&amp;quot;. took{ARG0,ARG1} and read{ARG0,ARG1} are two PAST structures associated with the two predicates took and read, respectively. All the other NSTs are not valid PASTs.</Paragraph>
      <Paragraph position="2"> Notice that, labeling ps,[?]s [?] NT with a PAST classifier (pastc) corresponds to solve the boundary problem. The critical points for the application of this strategy are: (1) how to design suitable features for the PAST characterization. This new problem requires a careful linguistic investigation about the significant properties of the argument spanning trees and (2) how to deal with the exponential number of NSTs.</Paragraph>
      <Paragraph position="3"> For the first problem, the use of tree kernels over the PASTs can be an alternative to the manual features design as the learning machine, (e.g. SVMs) can select the most relevant features from a high dimensional feature space. In other words, we can use Eq. 1 to estimate the similarity between two PASTs avoiding to define explicit features. The same idea has been successfully applied to the parse-tree re-ranking task (Taskar et al., 2004; Collins and Duffy, 2002) and predicate argument classification (Moschitti, 2004).</Paragraph>
      <Paragraph position="4"> For the second problem, i.e. the high computational complexity, we can cut the search space by using a traditional boundary classifier (tbc), e.g. (Pradhan et al., 2004), which provides a small set of potential argument nodes. Let PA be the set of nodes located by tbc as arguments. We may consider the set P of the NSTs associated with any subset of PA, i.e. P = {ps : s [?] PA}. However, also the classification ofP may be computationally problematic since theoretically there are |P |= 2|PA| members.</Paragraph>
      <Paragraph position="5"> In order to have a very efficient procedure, we applied pastc to only the PA sets associated with incorrect PASTs. A way to detect such incorrect NSTs is to look for a node pair &lt;n1,n2&gt;[?] PA x PA of overlapping nodes, i.e. n1 is ancestor of n2 or viceversa. After we have detected such nodes, we create two node sets PA1 = PA[?]{n1} and PA2 = PA[?]{n2} and classify them with the pastc to select the correct set of argument boundaries. This procedure can be generalized to a set of overlapping nodes O greater than 2 as reported in Appendix 1.</Paragraph>
      <Paragraph position="6"> Note that the algorithm selects a maximal set of non-overlapping nodes, i.e. the first that is generated. Additionally, the worst case is rather rare thus the algorithm is very fast on average.</Paragraph>
      <Paragraph position="7"> The Figure 3 shows a working example of the multi-stage classifier. In Frame (a), tbc labels as potential arguments (gray color) three overlapping nodes (in Arg.1). The overlap resolution algorithm proposes two solutions (Frame (b)) of which only one is correct. In fact, according to the second solution the propositional phrase &amp;quot;of the book&amp;quot; would incorrectly be attached to the verbal predicate, i.e. in contrast with the parse tree. The pastc, applied  to the two NSTs, should detect this inconsistency and provide the correct output. Note that, during the learning, we generate the non-overlapping structures in the same way to derive the positive and negative examples.</Paragraph>
    </Section>
    <Section position="2" start_page="52" end_page="52" type="sub_section">
      <SectionTitle>
4.2 Engineering Tree Fragment Features
</SectionTitle>
      <Paragraph position="0"> In the Frame (b) of Figure 3, we show one of the possible cases which pastc should deal with. The critical problem is that the two NSTs are perfectly identical, thus, it is not possible to discern between them using only their parse-tree fragments.</Paragraph>
      <Paragraph position="1"> The solution to engineer novel features is to simply add the boundary information provided by the tbc to the NSTs. We mark with a progressive number the phrase type corresponding to an argument node, starting from the leftmost argument. For example, in the first NST of Frame (c), we mark as NP-0 and NP-1 the first and second argument nodes whereas in the second NST we have an hypothesis of three arguments on the NP, NP and PP nodes. We trasform them in NP-0, NP-1 and PP-2.</Paragraph>
      <Paragraph position="2"> This simple modification enables the tree kernel to generate features useful to distinguish between two identical parse trees associated with different argument structures. For example, for the first NST the fragments [NP-1 [NP][PP]], [NP [DT][NN]] and [PP [IN][NP]] are generated. They do not match anymore with the [NP-0 [NP][PP]], [NP-1 [DT][NN]] and [PP-2 [IN][NP]] fragments of the second NST.</Paragraph>
      <Paragraph position="3"> In order to verify the relevance of our model, the next section provides empirical evidence about the effectiveness of our approach.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="52" end_page="53" type="metho">
    <SectionTitle>
5 The Experiments
</SectionTitle>
    <Paragraph position="0"> The experiments were carried out with the SVM-light-TK software available at http://ai-nlp.info.uniroma2.it/moschitti/ which encodes the tree kernels in the SVM-light software (Joachims, 1999). For tbc, we used the linear kernel with a regularization parameter (option -c) equal to 1 and a cost-factor (option -j) of 10 to have a higher Recall. For the pastc we used l = 0.4 (see (Moschitti, 2004)).</Paragraph>
    <Paragraph position="1"> As referring dataset, we used the PropBank corpora available at www.cis.upenn.edu/[?]ace, along with the Penn TreeBank 2 (www.cis.upenn.edu/[?]treebank) (Marcus et al., 1993). This corpus contains about 53,700 sentences and a fixed split between training and testing which has been used in other researches, e.g. (Pradhan et al., 2004; Gildea and Palmer, 2002).</Paragraph>
    <Paragraph position="2"> We did not include continuation and co-referring arguments in our experiments.</Paragraph>
    <Paragraph position="3"> We used sections from 02 to 07 (54,443 argument nodes and 1,343,046 non-argument nodes) to train the traditional boundary classifier (tbc). Then, we applied it to classify the sections from 08 to 21 (125,443 argument nodes vs. 3,010,673 non-argument nodes). As results we obtained 2,988 NSTs containing at least an overlapping node pair out of the total 65,212 predicate structures (according to the tbc decisions). From the 2,988 overlapping structures we extracted 3,624 positive and 4,461 negative NSTs, that we used to train the pastc.</Paragraph>
    <Paragraph position="4"> The performance was evaluated with the F1 measure2 over the section 23. This contains 10,406 argument nodes out of 249,879 parse tree nodes. By applying the tbc classifier we derived 235 overlapping NSTs, from which we extracted 204 PASTs and 385 incorrect predicate argument structures. On such test data, the performance of pastc was very high, i.e. 87.08% in Precision and 89.22% in Recall. Using the pastc we removed from the tbc the PA that cause overlaps. To measure the impact on the boundary identification performance, we compared it with three different boundary classification baselines: null * tbc: overlaps are ignored and no decision is taken. This provides an upper bound for the recall as no potential argument is rejected for later labeling. Notice that, in presence of overlapping nodes, the sentence cannot be annotated correctly.</Paragraph>
    <Paragraph position="5"> * RND: one among the non-overlapping structures with maximal number of arguments is randomly selected.</Paragraph>
    <Paragraph position="6"> 2F1 assigns equal importance to Precision P and Recall R, i.e. F1 = 2PxRP+R .</Paragraph>
    <Paragraph position="7">  non-overlapping structures (RND), the heuristic to select the most suitable non-overlapping node set (Heu) and the predicate argument spanning tree classifier (pastc).</Paragraph>
    <Paragraph position="8"> * Heu (heuristic): one of the NSTs which contain the nodes with the lowest overlapping score is chosen. This score counts the number of overlapping node pairs in the NST. For example, in Figure 3.(a) we have a NP that overlaps with two nodes NP and PP, thus it is assigned a score of 2.</Paragraph>
    <Paragraph position="9"> The third row of Table 1 shows the results of tbc, tbc + RND, tbc + Heu and tbc + pastc in the columns 2,3,4 and 5, respectively. We note that: * The tbc F1 is slightly higher than the result obtained in (Pradhan et al., 2004), i.e. 95.37% vs. 93.8% on same training/testing conditions, i.e. (same PropBank version, same training and testing split and same machine learning algorithm). This is explained by the fact that we did not include the continuations and the co-referring arguments that are more difficult to detect.</Paragraph>
    <Paragraph position="10"> * Both RND and Heu do not improve the tbc result. This can be explained by observing that in the 50% of the cases a correct node is removed.</Paragraph>
    <Paragraph position="11"> * When, to select the correct node, the pastc is used, the F1 increases of 1.49%, i.e. (96.86 vs.</Paragraph>
    <Paragraph position="12"> 95.37). This is a very good result considering that to increase the very high baseline of tbc is hard.</Paragraph>
    <Paragraph position="13"> In order to give a fairer evaluation of our approach we tested the above classifiers on the overlapping structures only, i.e. we measured the pastc improvement on all and only the structures that required its application. Such reduced test set contains 642 argument nodes and 15,408 non-argument nodes. The fourth row of Table 1 reports the classifier performance on such task. We note that the pastc improves the other heuristics of about 20%.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML