File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/p05-1023_metho.xml
Size: 26,200 bytes
Last Modified: 2025-10-06 14:09:40
<?xml version="1.0" standalone="yes"?> <Paper uid="P05-1023"> <Title>Data-Defined Kernels for Parse Reranking Derived from Probabilistic Models</Title> <Section position="3" start_page="181" end_page="182" type="metho"> <SectionTitle> 2 Kernels Derived from Probabilistic </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="181" end_page="181" type="sub_section"> <SectionTitle> Models </SectionTitle> <Paragraph position="0"> In recent years, several methods have been proposed for constructing kernels from trained probabilistic models. As usual, these kernels are then used with linear classifiers to learn the desired task. As well as some empirical successes, these methods are motivated by theoretical results which suggest we should expect some improvement with these classifiers over the classifier which chooses the most probable answer according to the probabilistic model (i.e. the maximum a posteriori (MAP) classifier). There is guaranteed to be a linear classifier for the derived kernel which performs at least as well as the MAP classifier for the probabilistic model. So, assuming a large-margin classifier can optimize a more appropriate criteria than the posterior probability, we should expect the derived kernel's classifier to perform better than the probabilistic model's classifier, although empirical results on a given task are never guaranteed.</Paragraph> <Paragraph position="1"> In this section, we first present two previous kernels and then propose a new kernel specifically for reranking tasks. In each of these discussions we need to characterize the parsing problem as a classification task. Parsing can be regarded as a mapping from an input space of sentences x[?]X to a structured output space of parse trees y[?]Y. On the basis of training sentences, we learn a discriminant function F : X x Y - R. The parse tree y with the largest value for this discriminant function F(x,y) is the output parse tree for the sentence x. We focus on the linear discriminant functions: Fw(x,y) = <w,ph(x,y)>, where ph(x,y) is a feature vector for the sentence-tree pair, w is a parameter vector for the discriminant function, and <a,b> is the inner product of vectors a and b. In the remainder of this section, we will characterize the kernel methods we consider in terms of the feature extractor ph(x,y).</Paragraph> </Section> <Section position="2" start_page="181" end_page="181" type="sub_section"> <SectionTitle> 2.1 Fisher Kernels </SectionTitle> <Paragraph position="0"> The Fisher kernel (Jaakkola and Haussler, 1998) is one of the best known kernels belonging to the class of probability model based kernels. Given a generative model of P(z|^th) with smooth parameterization, the Fisher score of an example z is a vector of partial derivatives of the log-likelihood of the example with respect to the model parameters:</Paragraph> <Paragraph position="2"> This score can be regarded as specifying how the model should be changed in order to maximize the likelihood of the example z. Then we can define the similarity between data points as the inner product of the corresponding Fisher scores. This kernel is often referred to as the practical Fisher kernel. The theoretical Fisher kernel depends on the Fisher information matrix, which is not feasible to compute for most practical tasks and is usually omitted.</Paragraph> <Paragraph position="3"> The Fisher kernel is only directly applicable to binary classification tasks. We can apply it to our task by considering an example z to be a sentence-tree pair (x,y), and classifying the pairs into correct parses versus incorrect parses. When we use the Fisher score ph^th(x,y) in the discriminant function F, we can interpret the value as the confidence that the tree y is correct, and choose the y in which we are the most confident.</Paragraph> </Section> <Section position="3" start_page="181" end_page="182" type="sub_section"> <SectionTitle> 2.2 TOP Kernels </SectionTitle> <Paragraph position="0"> Tsuda (2002) proposed another kernel constructed from a probabilistic model, called the Tangent vectors Of Posterior log-odds (TOP) kernel. Their TOP kernel is also only for binary classification tasks, so, as above, we treat the input z as a sentence-tree pair and the output category c [?] {[?]1,+1} as incorrect/correct. It is assumed that the true probability distribution is included in the class of probabilistic models and that the true parameter vector thstar is unique. The feature extractor of the TOP kernel for the input z is defined by: ph^th(z) = (v(z, ^th), [?]v(z,^th)[?]th1 ,..., [?]v(z,^th)[?]thl ), where v(z, ^th) = logP(c=+1|z, ^th) [?] logP(c=[?]1|z, ^th).</Paragraph> <Paragraph position="1"> In addition to being at least as good as the MAP classifier, the choice of the TOP kernel feature extractor is motivated by the minimization of the binary classification error of a linear classifier <w,ph^th(z)> + b. Tsuda (2002) demonstrates that this error is closely related to the estimation error of the posterior probability P(c=+1|z,thstar) by the estimator g(<w,ph^th(z)> + b), where g is the sigmoid function g(t) = 1/(1 + exp([?]t)).</Paragraph> <Paragraph position="2"> The TOP kernel isn't quite appropriate for structured classification tasks because ph^th(z) is motivated by binary classificaton error minimization. In the next subsection, we will adapt it to structured classification. null</Paragraph> </Section> <Section position="4" start_page="182" end_page="182" type="sub_section"> <SectionTitle> 2.3 A TOP Kernel for Reranking </SectionTitle> <Paragraph position="0"> We define the reranking task as selecting a parse tree from the list of candidate trees suggested by a probabilistic model. Furthermore, we only consider learning to rerank the output of a particular probabilistic model, without requiring the classifier to have good performance when applied to a candidate list provided by a different model. In this case, it is natural to model the probability that a parse tree is the best candidate given the list of candidate trees:</Paragraph> <Paragraph position="2"> where y1,...,ys is the list of candidate parse trees.</Paragraph> <Paragraph position="3"> To construct a new TOP kernel for reranking, we apply an approach similar to that used for the TOP kernel (Tsuda et al., 2002), but we consider the probability P(yk|x,y1,...,ys,thstar) instead of the probability P(c=+1|z,thstar) considered by Tsuda. The resulting feature extractor is given by: ph^th(x,yk) = (v(x,yk, ^th), [?]v(x,yk,^th)[?]th1 ,..., [?]v(x,yk,^th)[?]thl ), where v(x,yk, ^th) = logP(yk|y1,...,ys, ^th) [?] logsummationtexttnegationslash=k P(yt|y1,...,ys, ^th). We will call this kernel the TOP reranking kernel.</Paragraph> </Section> </Section> <Section position="4" start_page="182" end_page="183" type="metho"> <SectionTitle> 3 The Probabilistic Model </SectionTitle> <Paragraph position="0"> To complete the definition of the kernel, we need to choose a probabilistic model of parsing. For this we use a statistical parser which has previously been shown to achieve state-of-the-art performance, namely that proposed in (Henderson, 2003). This parser has two levels of parameterization. The first level of parameterization is in terms of a history-based generative probability model, but this level is not appropriate for our purposes because it defines an infinite number of parameters (one for every possible partial parse history). When parsing a given sentence, the bounded set of parameters which are relevant to a given parse are estimated using a neural network. The weights of this neural network form the second level of parameterization. There is a finite number of these parameters. Neural network training is applied to determine the values of these parameters, which in turn determine the values of the probability model's parameters, which in turn determine the probabilistic model of parse trees.</Paragraph> <Paragraph position="1"> We do not use the complete set of neural network weights to define our kernels, but instead we define a third level of parameterization which only includes the network's output layer weights. These weights define a normalized exponential model, with the network's hidden layer as the input features. When we tried using the complete set of weights in some small scale experiments, training the classifier was more computationally expensive, and actually performed slightly worse than just using the output weights.</Paragraph> <Paragraph position="2"> Using just the output weights also allows us to make some approximations in the TOP reranking kernel which makes the classifier learning algorithm more efficient.</Paragraph> <Section position="1" start_page="182" end_page="183" type="sub_section"> <SectionTitle> 3.1 A History-Based Probability Model </SectionTitle> <Paragraph position="0"> As with many other statistical parsers (Ratnaparkhi, 1999; Collins, 1999; Charniak, 2000), Henderson (2003) uses a history-based model of parsing. He defines the mapping from phrase structure trees to parse sequences using a form of left-corner parsing strategy (see (Henderson, 2003) for more details).</Paragraph> <Paragraph position="1"> The parser actions include: introducing a new constituent with a specified label, attaching one constituent to another, and predicting the next word of the sentence. A complete parse consists of a sequence of these actions, d1,...,dm, such that performing d1,...,dm results in a complete phrase structure tree.</Paragraph> <Paragraph position="2"> Because this mapping to parse sequences is one-to-one, and the word prediction actions in a complete parse d1,...,dm specify the sentence, P(d1,...,dm) is equivalent to the joint probability of the output phrase structure tree and the input sentence. This probability can be then be decomposed into the multiplication of the probabilities of each action decision di conditioned on that decision's prior parse history d1,...,di[?]1.</Paragraph> <Paragraph position="4"/> </Section> <Section position="2" start_page="183" end_page="183" type="sub_section"> <SectionTitle> 3.2 Estimating Decision Probabilities with a Neural Network </SectionTitle> <Paragraph position="0"> The parameters of the above probability model are the P(di|d1,...,di[?]1). There are an infinite number of these parameters, since the parse history d1,...,di[?]1 grows with the length of the sentence. In other work on history-based parsing, independence assumptions are applied so that only a finite amount of information from the parse history can be treated as relevant to each parameter, thereby reducing the number of parameters to a finite set which can be estimated directly. Instead, Henderson (2003) uses a neural network to induce a finite representation of this unbounded history, which we will denote h(d1,...,di[?]1). Neural network training tries to find such a history representation which preserves all the information about the history which is relevant to estimating the desired probability.</Paragraph> <Paragraph position="2"> Using a neural network architecture called Simple Synchrony Networks (SSNs), the history representation h(d1,...,di[?]1) is incrementally computed from features of the previous decision di[?]1 plus a finite set of previous history representations h(d1,...,dj), j < i [?] 1. Each history representation is a finite vector of real numbers, called the network's hidden layer. As long as the history representation for position i [?] 1 is always included in the inputs to the history representation for position i, any information about the entire sequence could be passed from history representation to history representation and be used to estimate the desired probability. However, learning is biased towards paying more attention to information which passes through fewer history representations. null To exploit this learning bias, structural locality is used to determine which history representations are input to which others. First, each history representation is assigned to the constituent which is on the top of the parser's stack when it is computed. Then earlier history representations whose constituents are structurally local to the current representation's constituent are input to the computation of the correct representation. In this way, the number of representations which information needs to pass through in order to flow from history representation i to history representation j is determined by the structural distance between i's constituent and j's constituent, and not just the distance between i and j in the parse sequence. This provides the neural network with a linguistically appropriate inductive bias when it learns the history representations, as explained in more detail in (Henderson, 2003).</Paragraph> <Paragraph position="3"> Once it has computed h(d1,...,di[?]1), the SSN uses a normalized exponential to estimate a probability distribution over the set of possible next decisions di given the history:</Paragraph> <Paragraph position="5"> where by tht we denote the set of output layer weights, corresponding to the parser action t, N(di[?]1) defines a set of possible next parser actions after the step di[?]1 and th denotes the full set of model parameters.</Paragraph> <Paragraph position="6"> We trained SSN parsing models, using the on-line version of Backpropagation to perform the gradient descent with a maximum likelihood objective function. This learning simultaneously tries to optimize the parameters of the output computation and the parameters of the mappings h(d1,...,di[?]1). With multi-layered networks such as SSNs, this training is not guaranteed to converge to a global optimum, but in practice a network whose criteria value is close to the optimum can be found.</Paragraph> </Section> </Section> <Section position="5" start_page="183" end_page="184" type="metho"> <SectionTitle> 4 Large-Margin Optimization </SectionTitle> <Paragraph position="0"> Once we have defined a kernel over parse trees, general techniques for linear classifier optimization can be used to learn the given task. The most sophisticated of these techniques (such as Support Vector Machines) are unfortunately too computationally expensive to be used on large datasets like the Penn Treebank (Marcus et al., 1993). Instead we use a method which has often been shown to be virtually as good, the Voted Perceptron (VP) (Freund and Schapire, 1998) algorithm. The VP algorithm was originally applied to parse reranking in (Collins and Duffy, 2002) with the Tree kernel. We modify the perceptron training algorithm to make it more suitable for parsing, where zero-one classification loss is not the evaluation measure usually employed. We also develop a variant of the kernel defined in section 2.3, which is more efficient when used with the VP algorithm.</Paragraph> <Paragraph position="1"> Given a list of candidate trees, we train the classifier to select the tree with largest constituent F1 score. The F1 score is a measure of the similarity between the tree in question and the gold standard parse, and is the standard way to evaluate the accuracy of a parser. We denote the k'th candidate tree for the j'th sentence xj by yjk. Without loss of generality, let us assume that yj1 is the candidate tree with the largest F1 score.</Paragraph> <Paragraph position="2"> The Voted Perceptron algorithm is an ensemble method for combining the various intermediate models which are produced during training a perceptron. It demonstrates more stable generalization performance than the normal perceptron algorithm when the problem is not linearly separable (Freund and Schapire, 1998), as is usually the case.</Paragraph> <Paragraph position="3"> We modify the perceptron algorithm by introducing a new classification loss function. This modification enables us to treat differently the cases where the perceptron predicts a tree with an F1 score much smaller than that of the top candidate and the cases where the predicted and the top candidates have similar score values. The natural choice for the loss function would be [?](yjk,yj1) = F1(yj1) [?] F1(yjk), where F1(yjk) denotes the F1 score value for the parse tree yjk. This approach is very similar to slack variable rescaling for Support Vector Machines proposed in (Tsochantaridis et al., 2004). The learning algorithm we employed is presented in figure 1.</Paragraph> <Paragraph position="4"> When applying kernels with a large training corpus, we face efficiency issues because of the large number of the neural network weights. Even though we use only the output layer weights, this vector grows with the size of the vocabulary, and thus can be large. The kernels presented in section 2 all lead to feature vectors without many zero values. This</Paragraph> <Paragraph position="6"> happens because we compute the derivative of the normalization factor used in the network's estimation of P(di|d1,...,di[?]1). This normalization factor depends on the output layer weights corresponding to all the possible next decisions (see section 3.2).</Paragraph> <Paragraph position="7"> This makes an application of the VP algorithm infeasible in the case of a large vocabulary.</Paragraph> <Paragraph position="8"> We can address this problem by freezing the normalization factor when computing the feature vector. Note that we can rewrite the model log-probability of the tree as:</Paragraph> <Paragraph position="10"> We treat the parameters used to compute the first term as different from the parameters used to compute the second term, and we define our kernel only using the parameters in the first term. This means that the second term does not effect the derivatives in the formula for the feature vector ph(x,y). Thus the feature vector for the kernel will contain non-zero entries only in the components corresponding to the parser actions which are present in the candidate derivation for the sentence, and thus in the first vector component. We have applied this technique to the TOP reranking kernel, the result of which we will call the efficient TOP reranking kernel.</Paragraph> </Section> <Section position="6" start_page="184" end_page="185" type="metho"> <SectionTitle> 5 The Experimental Results </SectionTitle> <Paragraph position="0"> We used the Penn Treebank WSJ corpus (Marcus et al., 1993) to perform empirical experiments on the proposed parsing models. In each case the input to the network is a sequence of tag-word pairs.2 We report results for two different vocabulary sizes, varying in the frequency with which tag-word pairs must occur in the training set in order to be included explicitly in the vocabulary. A frequency threshold of 200 resulted in a vocabulary of 508 tag-word pairs (including tag-unknown word pairs) and a threshold of 20 resulted in 4215 tag-word pairs. We denote the probabilistic model trained with the vocabulary of 508 by the SSN-Freq[?]200, the model trained with the vocabulary of 4215 by the SSN-Freq[?]20.</Paragraph> <Paragraph position="1"> Testing the probabilistic parser requires using a beam search through the space of possible parses.</Paragraph> <Paragraph position="2"> We used a form of beam search which prunes the search after the prediction of each word. We set the width of this post-word beam to 40 for both testing of the probabilistic model and generating the candidate list for reranking. For training and testing of the kernel models, we provided a candidate list consisting of the top 20 parses found by the generative probabilistic model. When using the Fisher kernel, we added the log-probability of the tree given by the probabilistic model as the feature. This was not necessary for the TOP kernels because they already contain a feature corresponding to the probability estimated by the probabilistic model (see section 2.3). We trained the VP model with all three kernels using the 508 word vocabulary (Fisher-Freq[?]200, TOP-Freq[?]200, TOP-Eff-Freq[?]200) but only the efficient TOP reranking kernel model was trained with the vocabulary of 4215 words (TOP-Eff-Freq[?]20).</Paragraph> <Paragraph position="3"> The non-sparsity of the feature vectors for other kernels led to the excessive memory requirements and larger testing time. In each case, the VP model was run for only one epoch. We would expect some improvement if running it for more epochs, as has been empirically demonstrated in other domains (Freund and Schapire, 1998).</Paragraph> <Paragraph position="4"> To avoid repeated testing on the standard testing set, we first compare the different models with their performance on the validation set. Note that the validation set wasn't used during learning of the kernel models or for adjustment of any parameters.</Paragraph> <Paragraph position="5"> Standard measures of accuracy are shown in table 1.3 Both the Fisher kernel and the TOP kernels show better accuracy than the baseline probabilistic 3All our results are computed with the evalb program following the standard criteria in (Collins, 1999), and using the standard training (sections 2-22, 39,832 sentences, 910,196 words), validation (section 24, 1346 sentence, 31507 words), and testing (section 23, 2416 sentences, 54268 words) sets (Collins, 1999).</Paragraph> <Paragraph position="6"> precision (LP), and a combination of both (Fb=1) on validation set sentences of length at most 100.</Paragraph> <Paragraph position="7"> model, but only the improvement of the TOP kernels is statistically significant.4 For the TOP kernel, the improvement over baseline is about the same with both vocabulary sizes. Also note that the performance of the efficient TOP reranking kernel is the same as that of the original TOP reranking kernel, for the smaller vocabulary.</Paragraph> <Paragraph position="8"> For comparison to previous results, table 2 lists the results on the testing set for our best model (TOP-Efficient-Freq[?]20) and several other statistical parsers (Collins, 1999; Collins and Duffy, 2002; Collins and Roark, 2004; Henderson, 2003; Charniak, 2000; Collins, 2000; Shen and Joshi, 2004; Shen et al., 2003; Henderson, 2004; Bod, 2003).</Paragraph> <Paragraph position="9"> First note that the parser based on the TOP efficient kernel has better accuracy than (Henderson, 2003), which used the same parsing method as our base-line model, although the trained network parameters were not the same. When compared to other kernel methods, our approach performs better than those based on the Tree kernel (Collins and Duffy, 2002; Collins and Roark, 2004), and is only 0.2% worse than the best results achieved by a kernel method for parsing (Shen et al., 2003; Shen and Joshi, 2004).</Paragraph> </Section> <Section position="7" start_page="185" end_page="186" type="metho"> <SectionTitle> 6 Related Work </SectionTitle> <Paragraph position="0"> The first application of kernel methods to parsing was proposed by Collins and Duffy (2002). They used the Tree kernel, where the features of a tree are all its connected tree fragments. The VP algorithm was applied to rerank the output of a probabilistic model and demonstrated an improvement over the baseline.</Paragraph> <Paragraph position="1"> 4We measured significance with the randomized significance test of (Yeh, 2000).</Paragraph> <Paragraph position="2"> precision (LP), and a combination of both (Fb=1) on the entire testing set.</Paragraph> <Paragraph position="3"> Shen and Joshi (2003) applied an SVM based voting algorithm with the Preference kernel defined over pairs for reranking. To define the Preference kernel they used the Tree kernel and the Linear kernel as its underlying kernels and achieved state-of-the-art results with the Linear kernel. In (Shen et al., 2003) it was pointed out that most of the arbitrary tree fragments allowed by the Tree kernel are linguistically meaningless. The authors suggested the use of Lexical Tree Adjoining Grammar (LTAG) based features as a more linguistically appropriate set of features. They empirically demonstrated that incorporation of these features helps to improve reranking performance.</Paragraph> <Paragraph position="4"> Shen and Joshi (2004) proposed to improve margin based methods for reranking by defining the margin not only between the top tree and all the other trees in the candidate list but between all the pairs of parses in the ordered candidate list for the given sentence. They achieved the best results when training with an uneven margin scaled by the heuristic function of the candidates positions in the list. One potential drawback of this method is that it doesn't take into account the actual F1 score of the candidate and considers only the position in the list ordered by the F1 score. We expect that an improvement could be achieved by combining our approach of scaling updates by the F1 loss with the all pairs approach of (Shen and Joshi, 2004). Use of the F1 loss function during training demonstrated better performance comparing to the 0-1 loss function when applied to a structured classification task (Tsochantaridis et al., 2004).</Paragraph> <Paragraph position="5"> All the described kernel methods are limited to the reranking of candidates from an existing parser due to the complexity of finding the best parse given a kernel (i.e. the decoding problem). (Taskar et al., 2004) suggested a method for maximal margin parsing which employs the dynamic programming approach to decoding and parameter estimation problems. The efficiency of dynamic programming means that the entire space of parses can be considered, not just a candidate list. However, not all kernels are suitable for this method. The dynamic programming approach requires the feature vector of a tree to be decomposable into a sum over parts of the tree. In particular, this is impossible with the TOP and Fisher kernels derived from the SSN model. Also, it isn't clear whether the algorithm remains tractable for a large training set with long sentences, since the authors only present results for sentences of length less than or equal to 15.</Paragraph> </Section> class="xml-element"></Paper>