File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/w03-0402_intro.xml
Size: 9,869 bytes
Last Modified: 2025-10-06 14:01:53
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-0402"> <Title>An SVM Based Voting Algorithm with Application to Parse Reranking</Title> <Section position="2" start_page="0" end_page="2" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Support Vector Machines (SVMs) have been successfully used in many machine learning tasks. Unlike the error-driven algorithms, SVMs search for the hyperplane that separates a set of training samples that contain two distinct classes and maximizes the margin between these two classes. The ability to maximize the margin is believed to be the reason for SVMs' superiority over other classifiers. In addition, SVMs can achieve high performance even with input data of high dimensional feature space, especially because of the use of the &quot;kernel trick&quot;. However, the incorporation of SVMs into sequential models remains a problem. An obvious reason is that the output of an SVM is the distance to the separating hyperplane, but not a probability. A possible solution to this problem is to map SVMs' results into probabilities through a Sigmoid function, and use Viterbi search to combine those probabilities (Platt, 1999). However, this approach conflicts with SVMs' purpose of achieving the so-called global optimization1. First, this approach may constrain SVMs to local features because of the left-to-right scanning strategy. Furthermore, like other nongenerative Markov models, it suffers from the so-called 1By global we mean the use of quadratic optimization in margin maximization.</Paragraph> <Paragraph position="1"> label bias problem, which means that the transitions leaving a given state compete only against each other, rather than against all other transitions in the model (Lafferty et al., 2001). Intuitively, it is the local normalization that results in the label bias problem.</Paragraph> <Paragraph position="2"> One way of using discriminative machine learning algorithms in sequential models is to rerank the n-best outputs of a generative system. Reranking uses global features as well as local features, and does not make local normalization. If the output set is large enough, the reranking approach may help to alleviate the impact of the label bias problem, because the victim parses (i.e.</Paragraph> <Paragraph position="3"> those parses which get penalized due to the label bias problem) will have a chance to take part in the reranking. null In recent years, reranking techniques have been successfully applied to the so-called history-based models (Black et al., 1993), especially to parsing (Collins, 2000; Collins and Duffy, 2002). In a history-based model, the current decision depends on the decisions made previously. Therefore, we may regard parsing as a special form of sequential model without losing generality.</Paragraph> <Paragraph position="4"> Collins (2000) has proposed two reranking algorithms to rerank the output of an existing parser (Collins, 1999, Model 2). One is based on Markov Random Fields, and the other is based on a boosting approach. In (Collins and Duffy, 2002), the use of Voted Perceptron (VP) (Freund and Schapire, 1999) for the parse reranking problem has been described. In that paper, the tree kernel (Collins and Duffy, 2001) has been used to efficiently count the number of common subtrees as described in (Bod, 1998).</Paragraph> <Paragraph position="5"> In this paper we will follow the reranking approach.</Paragraph> <Paragraph position="6"> We describe a novel SVM-based voting algorithm for reranking. It provides an alternative way of using a large margin classifier for sequential models. Instead of using the parse tree itself as a training sample, we use a pair of parse trees as a sample, which is analogous to the preference relation used in the context of ordinal regression (Herbrich et al., 2000). Furthermore, we justify the algorithm through a modification of the proof of the large margin rank boundaries for ordinal regression. We then apply this algorithm to the parse reranking problem.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 1.1 A Short Introduction of SVMs </SectionTitle> <Paragraph position="0"> In this section, we give a short introduction of Support Vector Machines. We follow (Vapnik, 1998)'s definition of SVMs. For each training sample (yi,xi), yi represents its class, and xi represents its input vector defined on a d-dimensional space. Suppose the training samples {(y1,x1),...,(yn,xn)} (xi [?] Rd, yi [?] {[?]1,1}) can be separated by a hyperplane H: (x * w) + b = 0, which means</Paragraph> <Paragraph position="2"> where w is normal to the hyperplane. To train an SVM is equivalent to searching for the optimal separating hyper-plane that separates the training data without error and maximizes the margin between two classes of samples. It can be shown that maximizing the margin is equivalent to minimizing ||w||2.</Paragraph> <Paragraph position="3"> In order to handle linearly non-separable cases, we introduce a positive slack variable xi for each sample (yi,xi). Then training can be reduced to the following</Paragraph> <Paragraph position="5"> subject to: 0 [?] ai [?] C andsummationtexti aiyi = 0, where ai(i = 1...l) are the Lagrange multipliers, l is the total number of training samples, and C is a weighting parameter for mis-classification.</Paragraph> <Paragraph position="6"> Since linearly non-separable samples may become separable in a high-dimensional space, SVMs employ the &quot;kernel trick&quot; to implicitly separate training samples in a high-dimensional feature space. Let Ph : Rd mapsto- Rh be a function that maps a d-dimensional input vector x to an h-dimensional feature vector Ph(x). In order to search for the optimal separating hyperplane in the higher-dimensional feature space, we only need to substitute Ph(xi)*Ph(xj) with xi *xj in formula (2).</Paragraph> <Paragraph position="7"> If there is a function K, such that K(xi,xj) = Ph(xi)* Ph(xj), we don't need to compute Ph(xi) explicitly. K is called a kernel function. Thus during the training phase we need to solve the following QP problem.</Paragraph> <Paragraph position="9"> subject to: 0 [?] ai [?] C, andsummationtexti aiyi = 0.</Paragraph> <Paragraph position="10"> Let x be a test vector, the decision function is</Paragraph> <Paragraph position="12"> where sj is a training vector whose corresponding Lagrange multiplier aj > 0. sj is called a support vector.</Paragraph> <Paragraph position="13"> Ns is the total number of the support vectors. According to (4), the decision function only depends on the support vectors.</Paragraph> <Paragraph position="14"> It is worth noting that not any function K can be used as a kernel. We call function K : Rd x Rd mapsto- R a well-defined kernel if and only if there is a mapping function Ph : Rd mapsto- Rh such that, for any xi,xj [?] Rd, K(xi,xj) = Ph(xi)*Ph(xj). One way of testing whether a function is a well-defined kernel is to use the Mercer's theorem (Vapnik, 1998) by utilizing the positive semidefinteness property. However, as far as a discrete kernel is concerned, there is a more convenient way to show that a function is a well-defined kernel. This is achieved by showing that a function K is a kernel by finding the corresponding mapping function Ph. This method was used in the proof of the string subsequence kernel (Cristianini and Shawe-Tayor, 2000) and the tree kernel (Collins and Duffy, 2001).</Paragraph> </Section> <Section position="2" start_page="0" end_page="2" type="sub_section"> <SectionTitle> 1.2 Large Margin Classifiers </SectionTitle> <Paragraph position="0"> SVMs are called large margin classifiers because they search for the hyperplane that maximizes the margin. The validity of the large margin method is guaranteed by the theorems of Structural Risk Minimization (SRM) under Probably Approximately Correct (PAC) framework2; test error is related to training data error, number of training samples and the capacity of the learning machine (Smola et al., 2000).</Paragraph> <Paragraph position="1"> Vapnik-Chervonenkis (VC) dimension (Vapnik, 1999), as well as some other measures, is used to estimate the complexity of the hypothesis space, or the capacity of the learning machine. The drawback of VC dimension is that it ignores the structure of the mapping from training samples to hypotheses, and concentrates solely on the range of the possible outputs of the learning machine (Smola et al., 2000). In this paper we will use another measure, the so-called Fat Shattering Dimension (Shawe-Taylor et al., 1998), which is shown to be more accurate than VC dimension (Smola et al., 2000), to justify our voting algorithm, null Let F be a family of hypothesis functions. The fat shattering dimension of F is a function from margin r to the maximum number of samples such that any subset of 2SVM's theoretical accuracy is much lower than their actual performance. The ability to maximize the margin is believed to be the reason for SVMs' superiority over other classifiers. these samples can be classified with margin r by a function in F. An upper bound of the expected error is given in Theorem 1 below (Shawe-Taylor et al., 1998). We will use this theorem to justify the new voting algorithm.</Paragraph> <Paragraph position="2"> Theorem 1 Consider a real-valued function class F having fat-shattering function bounded above by the function afat : R - N which is continuous from the right. Fix th [?] R. If a learner correctly classifies m independently generated examples z with h = Tth(f) [?] Tth(F) such that erz(h) = 0 and r = min|f(xi) [?] th|, then with confidence 1 [?] d the expected error of h is bounded from above by</Paragraph> <Paragraph position="4"/> </Section> </Section> class="xml-element"></Paper>