File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/p05-1073_metho.xml
Size: 15,884 bytes
Last Modified: 2025-10-06 14:09:47
<?xml version="1.0" standalone="yes"?> <Paper uid="P05-1073"> <Title>Joint Learning Improves Semantic Role Labeling</Title> <Section position="4" start_page="590" end_page="591" type="metho"> <SectionTitle> FIRST/LAST WORD LEFT/RIGHT SISTER PHRASE-TYPE LEFT/RIGHT SISTER HEAD WORD/POS PARENT PHRASE-TYPE PARENT POS/HEAD-WORD ORDINAL TREE DISTANCE: Phrase Type with </SectionTitle> <Paragraph position="0"> appended length of PATH feature and classification models can be chained in a principled way, as in Equation 1. The features we used for local identification and classification models are outlined in Table 1. These features are a subset of features used in previous work. The standard features at the top of the table were defined by (Gildea and Jurafsky, 2002), and the rest are other useful lexical and structural features identified in more recent work (Pradhan et al., 2004; Surdeanu et al., 2003; Xue and Palmer, 2004). The most direct way to use trained local identification and classification models in testing is to select a labeling L of the parse tree that maximizes the product of the probabilities according to the two models as in Equation 1. Since these models are local, this is equivalent to independently maximizing the product of the probabilities of the two models for the label li of each parse tree node ni as shown below in Equation 2.</Paragraph> <Paragraph position="2"> A problem with this approach is that a maximizing labeling of the nodes could possibly violate the constraint that argument nodes should not overlap with each other. Therefore, to produce a consistent set of arguments with local classifiers, we must have a way of enforcing the non-overlapping constraint.</Paragraph> <Section position="1" start_page="590" end_page="591" type="sub_section"> <SectionTitle> 3.1 Enforcing the Non-overlapping Constraint </SectionTitle> <Paragraph position="0"> Here we describe a fast exact dynamic programming algorithm to find the most likely non-overlapping (consistent) labeling of all nodes in the parse tree, according to a product of probabilities from local models, as in Equation 2. For simplicity, we describe the dynamic program for the case where only two classes are possible - ARG and NONE. The generalization to more classes is straightforward. Intuitively, the algorithm is similar to the Viterbi algorithm for context-free grammars, because we can describe the non-overlapping constraint by a &quot;grammar&quot; that disallows ARG nodes to have ARG descendants. null Below we will talk about maximizing the sum of the logs of local probabilities rather than the product of local probabilities, which is equivalent. The dynamic program works from the leaves of the tree up and finds a best assignment for each tree, using already computed assignments for its children. Suppose we want the most likely consistent assignment for subtree t with children trees t1,... ,tk each storing the most likely consistent assignment of nodes it dominates as well as the log-probability of the assignment of all nodes it dominates to NONE. The most likely assignment for t is the one that corresponds to the maximum of: * The sum of the log-probabilities of the most likely assignments of the children subtrees t1,... ,tk plus the log-probability for assigning the node t to NONE * The sum of the log-probabilities for assigning all of ti's nodes to NONE plus the log-probability for assigning the node t to ARG.</Paragraph> <Paragraph position="1"> Propagating this procedure from the leaves to the root of t, we have our most likely non-overlapping assignment. By slightly modifying this procedure, we obtain the most likely assignment according to a product of local identification and classification models. We use the local models in conjunction with this search procedure to select a most likely labeling in testing. Test set results for our local model P lscriptSRL are given in Table 2.</Paragraph> </Section> </Section> <Section position="5" start_page="591" end_page="593" type="metho"> <SectionTitle> 4 Joint Classifiers </SectionTitle> <Paragraph position="0"> As discussed in previous work, there are strong dependencies among the labels of the semantic argument nodes of a verb. A drawback of local models is that, when they decide the label of a parse tree node, they cannot use information about the labels and features of other nodes in the tree.</Paragraph> <Paragraph position="1"> Furthermore, these dependencies are highly nonlocal. For instance, to avoid repeating argument labels in a frame, we need to add a dependency from each node label to the labels of all other nodes.</Paragraph> <Paragraph position="2"> A factorized sequence model that assumes a finite Markov horizon, such as a chain Conditional Random Field (Lafferty et al., 2001), would not be able to encode such dependencies.</Paragraph> <Paragraph position="3"> The need for Re-ranking For argument identification, the number of possible assignments for a parse tree with n nodes is 2n. This number can run into the hundreds of billions for a normal-sized tree. For argument labeling, the number of possible assignments is [?] 20m, if m is the number of arguments of a verb (typically between 2 and 5), and 20 is the approximate number of possible labels if considering both core and modifying arguments. Training a model which has such huge number of classes is infeasible if the model does not factorize due to strong independence assumptions. Therefore, in order to be able to incorporate long-range dependencies in our models, we chose to adopt a re-ranking approach (Collins, 2000), which selects from likely assignments generated by a model which makes stronger independence assumptions. We utilize the top N assignments of our local semantic role labeling model P lscriptSRL to generate likely assignments. As can be seen from Table 3, for relatively small values of N, our re-ranking approach does not present a serious bottleneck to performance. We used a value of N = 20 for training. In Table 3 we can see that if we could pick, using an oracle, the best assignment out for the top 20 assignments according to the local model, we would achieve an F-Measure of 98.8 on all arguments. Increasing the number of N to 30 results in a very small gain in the upper bound on performance and a large increase in memory requirements. We therefore selected N = 20 as a good compromise.</Paragraph> <Paragraph position="4"> Generation of top N most likely joint assignments We generate the top N most likely non-overlapping joint assignments of labels to nodes in a parse tree according to a local model P lscriptSRL, by an exact dynamic programming algorithm, which is a generalization of the algorithm for finding the top non-overlapping assignment described in section 3.1.</Paragraph> <Section position="1" start_page="591" end_page="591" type="sub_section"> <SectionTitle> Parametric Models </SectionTitle> <Paragraph position="0"> We learn log-linear re-ranking models for joint semantic role labeling, which use feature maps from a parse tree and label sequence to a vector space. The form of the models is as follows. Let Ph(t,v,L) [?] Rs denote a feature map from a tree t, target verb v, and joint assignment L of the nodes of the tree, to the vector space Rs. Let L1,L2,*** ,LN denote top N possible joint assignments. We learn a log-linear model with a parameter vector W, with one weight for each of the s dimensions of the feature vector. The probability (or score) of an assignment L according to this re-ranking model is defined as:</Paragraph> <Paragraph position="2"> The score of an assignment L not in the top N is zero. We train the model to maximize the sum of log-likelihoods of the best assignments minus a quadratic regularization term.</Paragraph> <Paragraph position="3"> In this framework, we can define arbitrary features of labeled trees that capture general properties of predicate-argument structure.</Paragraph> </Section> <Section position="2" start_page="591" end_page="592" type="sub_section"> <SectionTitle> Joint Model Features </SectionTitle> <Paragraph position="0"> We will introduce the features of the joint re-ranking model in the context of the example parse tree shown in Figure 1. We model dependencies not only between the label of a node and the labels of other nodes, but also dependencies between the label of a node and input features of other argument nodes. The features are specified by instantiation of templates and the value of a feature is the number of times a particular pattern occurs in the labeled tree.</Paragraph> </Section> <Section position="3" start_page="592" end_page="593" type="sub_section"> <SectionTitle> Templates </SectionTitle> <Paragraph position="0"> For a tree t, predicate v, and joint assignment L of labels to the nodes of the tree, we define the candidate argument sequence as the sequence of non-NONE labeled nodes [n1,l1,... ,vPRED,nm,lm] (li is the label of node ni). A reasonable candidate argument sequence usually contains very few of the nodes in the tree - about 2 to 7 nodes, as this is the typical number of arguments for a verb. To make it more convenient to express our feature templates, we include the predicate node v in the sequence.</Paragraph> <Paragraph position="1"> This sequence of labeled nodes is defined with respect to the left-to-right order of constituents in the parse tree. Since non-NONE labeled nodes do not overlap, there is a strict left-to-right order among these nodes. The candidate argument sequence that corresponds to the correct assignment in Figure 1 will be: [NP1-ARG1,VBD1-PRED,PP1-ARG4,NP3-ARGM-TMP] Features from Local Models: All features included in the local models are also included in our joint models. In particular, each template for local features is included as a joint template that concatenates the local template and the node label. For example, for the local feature PATH, we define a joint feature template, that extracts PATH from every node in the candidate argument sequence and concatenates it with the label of the node. Both a feature with the specific argument label is created and a feature with the generic back-off ARG label. This is similar to adding features from identification and classification models. In the case of the example candidate argument sequence above, for the node NP1 we have the features: (NP|S|)-ARG1, (NP|S|)-ARG When comparing a local and a joint model, we use the same set of local feature templates in the two models.</Paragraph> <Paragraph position="2"> Whole Label Sequence: As observed in previous work (Gildea and Jurafsky, 2002; Pradhan et al., 2004), including information about the set or sequence of labels assigned to argument nodes should be very helpful for disambiguation. For example, including such information will make the model less likely to pick multiple fillers for the same role or to come up with a labeling that does not contain an obligatory argument. We added a whole label sequence feature template that extracts the labels of all argument nodes, and preserves information about the position of the predicate. The template also includes information about the voice of the predicate. For example, this template will be instantiated as follows for the example candidate argument sequence: null [ voice:active ARG1,PRED,ARG4,ARGM-TMP] We also add a variant of this feature which uses a generic ARG label instead of specific labels. This feature template has the effect of counting the number of arguments to the left and right of the predicate, which provides useful global information about argument structure. As previously observed (Pradhan et al., 2004), including modifying arguments in sequence features is not helpful. This was confirmed in our experiments and we redefined the whole label sequence features to exclude modifying arguments.</Paragraph> <Paragraph position="3"> One important variation of this feature uses the actual predicate lemma in addition to &quot;voice:active&quot;. Additionally, we define variations of these feature templates that concatenate the label sequence with features of individual nodes. We experimented with variations, and found that including the phrase type and the head of a directly dominating PP - if one exists - was most helpful. We also add a feature that detects repetitions of the same label in a candidate argument sequence, together with the phrase types of the nodes labeled with that label. For example, (NP-ARG0,WHNP-ARG0) is a common pattern of this form.</Paragraph> <Paragraph position="4"> Frame Features: Another very effective class of features we defined are features that look at the label of a single argument node and internal features of other argument nodes. The idea of these features is to capture knowledge about the label of a constituent given the syntactic realization of all arguments of the verb. This is helpful to capture syntactic alternations, such as the dative alternation. For example, consider the sentence (i) &quot;[Shaw Publishing]ARG0 offered [Mr. Smith]ARG2 [a reimbursement]ARG1 &quot; and the alternative realization (ii) &quot;[Shaw Publishing]ARG0 offered [a reimbursement]ARG1 [to Mr. Smith]ARG2&quot;. When classifying the NP in object position, it is useful to know whether the following argument is a PP. If yes, the NP will more likely be an ARG1, and if not, it will more likely be an ARG2. A feature template that captures such information extracts, for each argument node, its phrase type and label in the context of the phrase types for all other arguments. For example, the instantiation of such a template for [a reimbursement] in (ii) would be [ voice:active NP,PRED,NP-ARG1,PP] We also add a template that concatenates the identity of the predicate lemma itself.</Paragraph> <Paragraph position="5"> We should note that Xue and Palmer (2004) define a similar feature template, called syntactic frame, which often captures similar information. The important difference is that their template extracts contextual information from noun phrases surrounding the predicate, rather than from the sequence of argument nodes. Because our model is joint, we are able to use information about other argument nodes when labeling a node.</Paragraph> </Section> <Section position="4" start_page="593" end_page="593" type="sub_section"> <SectionTitle> Final Pipeline </SectionTitle> <Paragraph position="0"> Here we describe the application in testing of a joint model for semantic role labeling, using a local model P lscriptSRL, and a joint re-ranking model P rSRL.</Paragraph> <Paragraph position="1"> PlscriptSRL is used to generate top N non-overlapping joint assignments L1,... ,LN.</Paragraph> <Paragraph position="2"> One option is to select the best Li according to PrSRL, as in Equation 3, ignoring the score from the local model. In our experiments, we noticed that for larger values of N, the performance of our re-ranking model P rSRL decreased. This was probably due to the fact that at test time the local classifier produces very poor argument frames near the bottom of the top N for large N. Since the re-ranking model is trained on relatively few good argument frames, it cannot easily rule out very bad frames. It makes sense then to incorporate the local model into our final score. Our final score is given by:</Paragraph> <Paragraph position="4"> where a is a tunable parameter 2 for how much influence the local score has in the final score. Such interpolation with a score from a first-pass model was also used for parse re-ranking in (Collins, 2000).</Paragraph> <Paragraph position="5"> Given this score, at test time we choose among the top N local assignments L1,... ,LN according to: arg max L[?]{L1,...,LN} a log P lscriptSRL(L|t,v) + log P rSRL(L|t,v)</Paragraph> </Section> </Section> class="xml-element"></Paper>