File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/p02-1034_metho.xml

Size: 16,768 bytes

Last Modified: 2025-10-06 14:07:57

<?xml version="1.0" standalone="yes"?>
<Paper uid="P02-1034">
  <Title>New Ranking Algorithms for Parsing and Tagging: Kernels over Discrete Structures, and the Voted Perceptron</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Feature-Vector Representations of Parse
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Trees and Tagged Sequences
</SectionTitle>
      <Paragraph position="0"> This paper focuses on the task of choosing the correct parse or tag sequence for a sentence from a group of &amp;quot;candidates&amp;quot; for that sentence. The candidates might be enumerated by a number of methods.</Paragraph>
      <Paragraph position="1"> The experiments in this paper use the top a0 candidates from a baseline probabilistic model: the model of (Collins 1999) for parsing, and a maximum-entropy tagger for named-entity recognition.</Paragraph>
      <Paragraph position="2"> 2i.e., polynomial in the number of training examples, and the size of trees or sentences in training and test data.</Paragraph>
      <Paragraph position="3"> Computational Linguistics (ACL), Philadelphia, July 2002, pp. 263-270. Proceedings of the 40th Annual Meeting of the Association for The choice of representation is central: what features should be used as evidence in choosing between candidates? We will use a function a1a3a2a5a4a7a6a9a8 a10a12a11 to denote a a13 -dimensional feature vector that represents a tree or tagged sequence a4 . There are many possibilities for a1a14a2a5a4a7a6 . An obvious example for parse trees is to have one component of a1a3a2a5a4a7a6 for each rule in a context-free grammar that underlies the trees. This is the representation used by Stochastic Context-Free Grammars. The feature vector tracks the counts of rules in the tree a4 , thus encoding the sufficient statistics for the SCFG.</Paragraph>
      <Paragraph position="4"> Given a representation, and two structures a4 and a15 , the inner product between the structures can be defined as  The idea of inner products between feature vectors is central to learning algorithms such as Support Vector Machines (SVMs), and is also central to the ideas in this paper. Intuitively, the inner product is a similarity measure between objects: structures with similar feature vectors will have high values for a1a3a2a5a4a7a6a27a16a28a1a14a2  a6 . More formally, it has been observed that many algorithms can be implemented using inner products between training examples alone, without direct access to the feature vectors themselves. As we will see in this paper, this can be crucial for the efficiency of learning with certain representations.</Paragraph>
      <Paragraph position="5"> Following the SVM literature, we call a function</Paragraph>
      <Paragraph position="7"> a15 a &amp;quot;kernel&amp;quot; if it can be shown that a29 is an inner product in some feature space a1 .</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Algorithms
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Notation
</SectionTitle>
      <Paragraph position="0"> This section formalizes the idea of linear models for parsing or tagging. The method is related to the boosting approach to ranking problems (Freund et al. 1998), the Markov Random Field methods of (Johnson et al. 1999), and the boosting approaches for parsing in (Collins 2000). The set-up is as follows:a31 null Training data is a set of example input/output pairs. In parsing the training examples are a32a18a33</Paragraph>
      <Paragraph position="2"> is a sentence and each a34  is the correct tree for that sentence.a31 We assume some way of enumerating a set of candidates for a particular sentence. We use a4 a22a39a38 to denote the a40 'th candidate for the a41 'th sentence in training data, and a42 a2 a33  a6 in the space a10a12a11 . The parameters of the model are also a vector a49 a8 a10a12a11 . The output of the model on a training or test example a33 is</Paragraph>
      <Paragraph position="4"> The key question, having defined a representation a1 , is how to set the parameters a49 . We discuss one method for setting the weights, the perceptron algorithm, in the next section.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 The Perceptron Algorithm
</SectionTitle>
      <Paragraph position="0"> Figure 1(a) shows the perceptron algorithm applied to the ranking task. The method assumes a training set as described in section 3.1, and a representation a1 of parse trees. The algorithm maintains a parameter vector a49 , which is initially set to be all zeros. The algorithm then makes a pass over the training set, only updating the parameter vector when a mistake is made on an example. The parameter vector update is very simple, involving adding the difference of the offending examples' representations</Paragraph>
      <Paragraph position="2"> a6 in the figure). Intuitively, this update has the effect of increasing the parameter values for features in the correct tree, and downweighting the parameter values for features in the competitor.</Paragraph>
      <Paragraph position="3"> See (Cristianini and Shawe-Taylor 2000) for discussion of the perceptron algorithm, including an overview of various theorems justifying this way of setting the parameters. Briefly, the perceptron algorithm is guaranteed3 to find a hyperplane that correctly classifies all training points, if such a hyper-plane exists (i.e., the data is &amp;quot;separable&amp;quot;). Moreover, the number of mistakes made will be low, providing that the data is separable with &amp;quot;large margin&amp;quot;, and 3To find such a hyperplane the algorithm must be run over the training set repeatedly until no mistakes are made. The algorithm in figure 1 includes just a single pass over the training set.</Paragraph>
      <Paragraph position="4">  this translates to guarantees about how the method generalizes to test examples. (Freund &amp; Schapire 1999) give theorems showing that the voted perceptron (a variant described below) generalizes well even given non-separable data.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 The Algorithm in Dual Form
</SectionTitle>
      <Paragraph position="0"> Figure 1(b) shows an equivalent algorithm to the perceptron, an algorithm which we will call the &amp;quot;dual form&amp;quot; of the perceptron. The dual-form algorithm does not store a parameter vector a49 , instead storing a set of dual parameters, a81  . The score for a parse a4 is defined by the dual parameters as</Paragraph>
      <Paragraph position="2"> This is in contrast to a76 a2a5a4a7a6a3a19 a49 a16a52a1a14a2a5a4a7a6 , the score in the original algorithm.</Paragraph>
      <Paragraph position="3"> In spite of these differences the algorithms give identical results on training and test examples: to see this, it can be verified that a49 a19</Paragraph>
      <Paragraph position="5"> The important difference between the algorithms lies in the analysis of their computational complexity. Say a103 is the size of the training set, i.e.,</Paragraph>
      <Paragraph position="7"> . Also, take a13 to be the dimensionality of the parameter vector a49 . Then the algorithm in figure 1(a) takes a104 a2 a103a105a13 a6 time.4 This follows because a76 a2a5a4a7a6 must be calculated for each member of the training set, and each calculation of a76 involves</Paragraph>
      <Paragraph position="9"> a106a108a107a45a109a111a110 are sparse, then a112 can be taken to be the number of non-zero elements of a106 , assuming that it takes a113a114a107a45a112a65a110 time to add feature vectors with a113a114a107a45a112a65a110 non-zero elements, or to take inner products.</Paragraph>
      <Paragraph position="10"> inner product between two examples is a115 . The running time of the algorithm in figure 1(b) is a104 a2 a103 a0 a115 a6 . This follows because throughout the algorithm the number of non-zero dual parameters is bounded by a0 , and hence the calculation of</Paragraph>
      <Paragraph position="12"> a6 time. (Note that the dual form algorithm runs in quadratic time in the number of training examples</Paragraph>
      <Paragraph position="14"> The dual algorithm is therefore more efficient in cases where a0 a115a119a118a120a118a47a13 . This might seem unlikely to be the case - naively, it would be expected that the time to calculate the inner product a1a3a2a5a4a7a6a14a16a97a1a3a2 a15 a6 between two vectors to be at least a104 a2 a13 a6 . But it turns out that for some high-dimensional representations the inner product can be calculated in much better than a104 a2 a13 a6 time, making the dual form algorithm more efficient than the original algorithm. The dual-form algorithm goes back to (Aizerman et al. 64).</Paragraph>
      <Paragraph position="15"> See (Cristianini and Shawe-Taylor 2000) for more explanation of the algorithm.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.4 The Voted Perceptron
</SectionTitle>
      <Paragraph position="0"> (Freund &amp; Schapire 1999) describe a refinement of the perceptron algorithm, the &amp;quot;voted perceptron&amp;quot;.</Paragraph>
      <Paragraph position="1"> They give theory which suggests that the voted perceptron is preferable in cases of noisy or unseparable data. The training phase of the algorithm is unchanged - the change is in how the method is applied to test examples. The algorithm in figure 1(b) can be considered to build a series of hypotheses a77a74a121 a2a5a4a7a6 , for  covering the man. The tree in (a) contains all of these subtrees, as well as many others.</Paragraph>
      <Paragraph position="2"> is a123a124a121 a2 a33 a6a125a19 a50a52a51a54a53a3a55a57a50a59a58 a60a97a62a18a64a67a66a28a68a37a69 a77a74a121 a2a5a4a7a6 . Thus the training algorithm can be considered to construct a sequence of a0 models, a123 a24 a46a27a46a27a46 a123 a91 . On a test sentence a33 , each of these a0 functions will return its own parse tree,</Paragraph>
      <Paragraph position="4"> the most likely tree as that which occurs most often in the set a32a59a123 a24 a2 a33 a6a127a30 a123</Paragraph>
      <Paragraph position="6"> cause of this the voted perceptron can be implemented with the same number of kernel calculations, and hence roughly the same computational complexity, as the original perceptron.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 A Tree Kernel
</SectionTitle>
    <Paragraph position="0"> We now consider a representation that tracks all sub-trees seen in training data, the representation studied extensively by (Bod 1998). See figure 2 for an example. Conceptually we begin by enumerating all tree fragments that occur in the training data a87 a46a27a46a27a46 a13 . Note that this is done only implicitly. Each tree is represented by a a13 dimensional vector where the a41 'th component counts the number of occurences of the a41 'th tree fragment. Define the func-</Paragraph>
    <Paragraph position="2"> will be huge (a given tree will have a number of sub-trees that is exponential in its size). Because of this we aim to design algorithms whose computational complexity is independent of a13 .</Paragraph>
    <Paragraph position="3"> The key to our efficient use of this representation is a dynamic programming algorithm that computes the inner product between two examples a4 a24 and a4  in polynomial (in the size of the trees involved), rather than a104 a2 a13 a6 , time. The algorithm is described in (Collins and Duffy 2001), but for completeness we repeat it here. We first define the set of nodes in trees a4 a24 and a4</Paragraph>
    <Paragraph position="5"> tively. We define the indicator function a134</Paragraph>
    <Paragraph position="7"> computation of the inner product is the following  where we define a154 a2 a0 a24 a30 a0</Paragraph>
    <Paragraph position="9"> ficiently, due to the following recursive definition:a31 If the productions at a0 a24 and a0  where a0 a160a56a2 a0 a24 a6 is the number of children of a0 a24 in the tree; because the productions at a0 a24 /a0  . The first two cases are trivially correct. The last, recursive, definition follows because a common subtree for a0 a24 and a0  can be formed by taking the production at a0 a24 /a0</Paragraph>
    <Paragraph position="11"> gether with a choice at each child of simply taking the non-terminal at that child, or any one of the common sub-trees at that child. Thus there are 5Pre-terminals are nodes directly above words in the surface string, for example the N, V, and D symbols in Figure 2.</Paragraph>
    <Paragraph position="12">  application being the conversion of Bod's model (Bod 1998) to an equivalent PCFG.) It is clear from the identity a1a3a2a5a4 a24 a6a162a16a7a1a3a2a5a4</Paragraph>
    <Paragraph position="14"> can be filled in, then summed.6 Since there will be many more tree fragments of larger size - say depth four versus depth three - it makes sense to downweight the contribution of larger tree fragments to the kernel. This can be achieved by introducing a parameter a85 a118 a165 a166 a87 , and modifying the base case and recursive case of the definitions of a154 to be re-</Paragraph>
    <Paragraph position="16"> responds to a modified kernel a1a14a2a5a4 a24 a6a43a16a17a1a3a2a5a4</Paragraph>
    <Paragraph position="18"> rules in the a41 'th fragment. This is roughly equivalent to having a prior that large sub-trees will be less useful in the learning task.</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 A Tagging Kernel
</SectionTitle>
    <Paragraph position="0"> The second problem we consider is tagging, where each word in a sentence is mapped to one of a finite set of tags. The tags might represent part-of-speech tags, named-entity boundaries, base noun-phrases, or other structures. In the experiments in this paper we consider named-entity recognition.</Paragraph>
    <Paragraph position="1"> 6This can be a pessimistic estimate of the runtime. A more useful characterization is that it runs in time linear in the number of members a107a45a174 a135a35a175 a174 a137 a110a12a176a43a177 a135a12a178 a177 a137 such that the productions at</Paragraph>
    <Paragraph position="3"> a137 are the same. In our data we have found the number of nodes with identical productions to be approximately linear in the size of the trees, so the running time is also close to linear in the size of the trees.</Paragraph>
    <Paragraph position="4"> A tagged sequence is a sequence of word/state pairs a4a117a19 a32a171a179 a24a150a180 a33 a24 a46a27a46a27a46 a179 a91a61a180 a33 a91</Paragraph>
    <Paragraph position="6"> is the tag for that word. The particular representation we consider is similar to the all sub-trees representation for trees. A taggedsequence &amp;quot;fragment&amp;quot; is a subgraph that contains a subsequence of state labels, where each label may or may not contain the word below it. See figure 3 for an example. Each tagged sequence is represented by a a13 dimensional vector where the a41 'th component</Paragraph>
    <Paragraph position="8"> The inner product under this representation can be calculated using dynamic programming in a very similar way to the tree algorithm. We first define the set of states in tagged sequences a4 a24 and a4  respectively. Each state has an associated label and an associated word. We define the indicator function a134</Paragraph>
    <Paragraph position="10"> are the same, and the words at a0 a24 and a0  are the same, then  There are a couple of useful modifications to this kernel. One is to introduce a parameter a85 a118 a165a186a166a155a87 which penalizes larger substructures. The recursive definitions are modfied to be a154  is the number of state labels in the a41 th fragment. Another useful modification is as follows. Define  = labeled recall/precision. CBs = average number of crossing brackets per sentence. 0 CBs, a189 CBs are the percentage of sentences with 0 or a187a159a189 crossing brackets respectively. CO99 is model 2 of (Collins 1999). VP is the voted perceptron with the tree kernel.</Paragraph>
    <Paragraph position="11">  share the same word features, 0 otherwise. For example, a190 a41a102a191  might be defined to be 1 if a179 a24 and a179  are both capitalized: in this case a190 a41a102a191  where a179 a24 , a179  are the words at a0 a24 and a0   respectively. This inner product implicitly includes features which track word features, and thus can make better use of sparse data.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML