File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-3222_metho.xml
Size: 27,843 bytes
Last Modified: 2025-10-06 14:09:27
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-3222"> <Title>The Leaf Projection Path View of Parse Trees: Exploring String Kernels for HPSG Parse Selection</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 The Leaf Projection Paths View of Parse </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Trees 2.1 Representing HPSG Signs </SectionTitle> <Paragraph position="0"> In HPSG, sentence analyses are given in the form of HPSG signs, which are large feature structures containing information about syntactic and semantic properties of the phrases.</Paragraph> <Paragraph position="1"> As in some of the previous work on the Redwoods corpus (Toutanova et al., 2002; Toutanova and Manning, 2002), we use the derivation trees as the main representation for disambiguation. Derivation trees record the combining rule schemas of the HPSG grammar which were used to license the sign by combining initial lexical types. The derivation tree is also the fundamental data stored in the Redwoods treebank, since the full sign can be reconstructed from it by reference to the grammar. The internal nodes represent, for example, head-complement, head-specifier, and head-adjunct schemas, which were used to license larger signs out of component parts. A derivation tree for the in bold are head nodes for the leaf word and the rest are non-head nodes.</Paragraph> <Paragraph position="2"> sentence Let us plan on that is shown in Figure 1. 2 Additionally, we annotate the nodes of the derivation trees with information extracted from the HPSG sign. The annotation of nodes is performed by extracting values of feature paths from the feature structure or by propagating information from children or parents of a node. In theory with enough annotation at the nodes of the derivation trees, we can recover the whole HPSG signs.</Paragraph> <Paragraph position="3"> Here we describe three node annotations that proved very useful for disambiguation. One is annotation with the values of the feature path synsem.local.cat.head - its values are basic parts of speech such as noun, verb, prep, adj, adv. Another is phrase structure category information associated with the nodes, which summarizes the values of several feature paths and is available in the Redwoods corpus as Phrase-Structure trees. The third is annotation with lexical type (le-type), which is the type of the head word at a node. The preterminals in Figure 1 are lexical item identifiers -- identifiers of the lexical entries used to construct the parse. The le-types are about a0a2a1a3a1 types in the HPSG type hierarchy and are the direct super-types of the lexical item identifiers. The le-types are not shown in this figure, but can be seen at the leaves in Figure 2. For example, the lexical type of LET V1 in the figure is v sorb. In Figure 1, the only annotation performed is with the values of synsem.local.cat.head.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 The Leaf Projection Paths View </SectionTitle> <Paragraph position="0"> The projection path of a leaf is the sequence of nodes from the leaf to the root of the tree. In Figure 2, the leaf projection paths for three of the words are shown.</Paragraph> <Paragraph position="1"> We can see that a node in the derivation tree par2This sentence has three possible analyses depending on the attachment of the preposition &quot;on&quot; and whether &quot;on&quot; is an adjunct or complement of &quot;plan&quot;.</Paragraph> <Paragraph position="2"> ticipates in the projection paths of all words dominated by that node. The original local rule configurations -- a node and its children, do not occur jointly in the projection paths; thus, if special annotation is not performed to recover it, this information is lost.</Paragraph> <Paragraph position="3"> As seen in Figure 2, and as is always true for a grammar that produces non-crossing lexical dependencies, there is an initial segment of the projection path for which the leaf word is a syntactic head (called head path from here on), and a final segment for which the word is not a syntactic head (called non-head path from here on). In HPSG non-local dependencies are represented in the final semantic representation, but can not be obtained via syntactic head annotation.</Paragraph> <Paragraph position="4"> If, in a traditional parsing model that estimates the likelihood of a local rule expansion given a node (such as e.g (Collins, 1997)), the tree nodes are annotated with the word of the lexical head, some information present in the word projection paths can be recovered. However, this is only the information in the head path part of the projection path. In further experiments we show that the non-head part of the projection path is very helpful for disambiguation. null Using this representation of derivation trees, we can apply string kernels to the leaf projection paths and combine those to obtain kernels on trees. In the rest of this paper we explore the application of string kernels to this task, comparing the performance of the new models to models using more standard rule features.</Paragraph> </Section> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Tree and String Kernels </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Kernels and SVM ranking </SectionTitle> <Paragraph position="0"> From a machine learning point of view, the parse selection problem can be formulated as follows: given</Paragraph> <Paragraph position="2"> where each a1 a2 is a natural language sentence, a0 is the number of such sentences, a20a22a21a24a23 a11a13a11a13a11 a0 , a6 a2a7a8a25 is a parse tree for a1a3a2 , a26a12a2 is the number of parses for a given sentence a1a3a2 , a4 a1a7a6 a2a7a8a25 a9 is a feature representation for the parse tree a6 a2a7a8a25 , and we are given the training information which of all a6 a2a7a8a25 is the correct parse learn how to correctly identify the correct parse of an unseen test sentence.</Paragraph> <Paragraph position="3"> One approach for solving this problem is via representing it as an SVM (Vapnik, 1998) ranking problem, where (without loss of generality) a6 a2a7a8a10 is assumed to be the correct parse for a1a27a2 . The goal is to learn a parameter vector a28a29 , such that the score of</Paragraph> <Paragraph position="5"> a9 ) is higher than the scores of all other parses for the sentence. Thus we optimize for: The a44 a2a7a8a25 are slack variables used to handle the non-separable case. The same formulation has been used in (Collins, 2001) and (Shen and Joshi, 2003). This problem can be solved by solving the dual, and thus we would only need inner products of the feature vectors. This allows for using the kernel trick, where we replace the inner product in the representation space by inner product in some feature space, usually different from the representation space. The advantage of using a kernel is associated with the computational effectiveness of computing it (it may not require performing the expensive transformation a4 explicitly).</Paragraph> <Paragraph position="6"> We learn SVM ranking models using a tree kernel defined via string kernels on projection paths.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Kernels on Trees Based on Kernels on Projection Paths </SectionTitle> <Paragraph position="0"> So far we have defined a representation of parse trees as lists of strings corresponding to projection paths of words. Now we formalize this representation and show how string kernels on projection paths extend to tree kernels.</Paragraph> <Paragraph position="1"> We introduce the notion of a keyed string -- a string that has a key, which is some letter from the alphabet a58 of the string. We can denote a keyed string by a pair a1a60a59 a5a8a3 a9 , where a59a62a61 a58 is the key, and a3 is the string. In our application, a key would be a word a29 , and the string would be the sequence of derivation tree nodes on the head or non-head part of the projection path of the word a29 . Additionally, for reducing sparsity, for each keyed string a5a8a3 a9 , we also include a keyed string a1a60a63a65a64a3a66 a5a8a3 a9 , where a63a65a64a16a66 is the le-type of the word a29 . Thus each projection path occurs twice in the list representation of the tree - once headed by the word, and once by its le-type. In our application, the strings a3 are sequences of annotated derivation tree nodes, e.g. a3 =&quot;LET V1:verb HCOMP:verb HCOMP:verb IM-PER:verb&quot; for the head projection path of let in Figure 2. The non-head projection path of let is empty.</Paragraph> <Paragraph position="2"> For a given kernel a0 on strings, we define its extension to keyed strings as follows:</Paragraph> <Paragraph position="4"> struction for all string kernels applied in this work.</Paragraph> <Paragraph position="5"> Given a tree a6 a18a51a21 a1a8a1a60a59 a18 a5a8a3 a18 a9 a5a13a11a13a11a13a11 a5 a1a60a59a1a0 a5a8a3a2a0 a9a8a9 and a tree a6a4a3 a21 a1a8a1a65a67 a18 a5a8a7 a18 a9 a5a13a11a13a11a13a11 a5 a1a65a67a6a5 a5a8a7a7a5 a9a8a9 , and a kernel a0 on keyed strings, we define a kernel a0a9a8 on the trees as follows:</Paragraph> <Paragraph position="7"> This can be viewed as a convolution (Haussler, 1999) and therefore a0a9a8 is a valid kernel (positive definite symmetric), if a0 is a valid kernel.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.3 String Kernels </SectionTitle> <Paragraph position="0"> We experimented with some of the string kernels proposed in (Lodhi et al., 2000; Leslie and Kuang, 2003), which have been shown to perform very well for indicating string similarity in other domains. In particular we applied the N-gram kernel, Subsequence kernel, and Wildcard kernel. We refer the reader to (Lodhi et al., 2000; Leslie and Kuang, 2003) for detailed formal definition of these kernels, and restrict ourselves to an intuitive description here. In addition, we devised a new kernel, called Repetition kernel, which we describe in detail. null The kernels used here can be defined as the inner product of the feature vectors of the two strings a0 (a3 , a7 )= a4 a1 xa9a14a13 a4 (a7 ), with feature map from the space of all finite sequences from a string alphabet a58 to a vector space indexed by a set of sub-sequences from a58 . As a simple example, the a23 -gram string kernel maps each string a3 a61 a58a16a15 to a vector with dimensionality a17a58a18a17 and each element in the vector indicates the number of times the corresponding symbol from a58 occurs in a3 . For example,</Paragraph> <Paragraph position="2"> The Repetition kernel is similar to the 1-gram kernel. It improves on the a23 -gram kernel by better handling cases with repeated occurrences of the same symbol. Intuitively, in the context of our application, this kernel captures the tendency of words to take (or not take) repeated modifiers of the same kind. For example, it may be likely that a ceratin verb take one PP-modifier, but less likely for it to take two or more.</Paragraph> <Paragraph position="3"> More specifically, the Repetition kernel is defined such that its vector space consists of all sequences from a58 composed of the same symbol. The feature map obtains matching of substrings of the input string to features, allowing the occurrence of gaps. There are two discount parameters a32 a18 and a32 a3 . a32a19a18 serves to discount features for the occurrence of gaps, and a32 a3 discounts longer symbol sequences.</Paragraph> <Paragraph position="4"> Formally, for an input string a3 , the value of the feature vector for the feature index sequence a33 a21 a59 a11a13a11a13a11 a59 , a17a33a34a17a51a21a36a35 , is defined as follows: Let a1 be the left-most minimal contiguous substring of a3 that contains a33 , a1a55a21 a1 a18 a11a13a11a13a11 a1a38a37 , where for indices a20 a18 a21</Paragraph> <Paragraph position="6"> The weighted Wildcard kernel performs matching by permitting a restricted number of matches to a wildcard character. A a1 a35 a5 a0 a9 wildcard kernel has as feature indices a35 -grams with up to a0 wildcard characters. Any character matches a wildcard. For example the 3-gram a59 a59 a67 will match the feature index a59a55a54 a67 in a (3,1) wildcard kernel. The weighting is based on the number of wildcard characters used - the weight is multiplied by a discount a32 for each wildcard.</Paragraph> <Paragraph position="7"> The Subsequence kernel was defined in (Lodhi et al., 2000). We used a variation where the kernel is defined by two integers a1 a35 a5a4a56 a9 and two discount factors a32 a18 and a32 a3 for gaps and characters. A subseq(k,g) kernel has as features all a57 -grams with a57a31a58a59a35 . The a56 is a restriction on the maximal span of the a57 -gram in the original string - e.g. if a35 a21 a37 and a56 a21a61a60 , the two letters of a a37 -gram can be at most a56 a53 a35a43a21 a37 letters apart in the original string.</Paragraph> <Paragraph position="8"> The weight of a feature is multiplied by a32a54a18 for each gap, and by a32 a3 for each non-gap. For the example above, if a32 a18 a21 a11 a0 a5 a32 a3 a21a62a30 a5 a35 a21 a37 a5a4a56 a21a62a30 ,</Paragraph> <Paragraph position="10"> dex a59 a59 matches only once in the string with a span at most a30 - for the sequence a59a27a26 a59 with a23 gap.</Paragraph> <Paragraph position="11"> The details of the algorithms for computing the kernels can be found in the fore-mentioned papers (Lodhi et al., 2000; Leslie and Kuang, 2003). To summarize, the kernels can be implemented efficiently using tries.</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Experiments </SectionTitle> <Paragraph position="0"> In this section we describe our experimental results using different string kernels and different feature annotation of parse trees. We learn Support Vector Machine (SVM) ranking models using the software package a65a67a66a69a68 a37a2a70a19a53a71 a48 (Joachims, 1999). We also normalized the kernels:</Paragraph> <Paragraph position="2"> For all tree kernels implemented here, we first extract all features, generating an explicit map to the space of the kernel, and learn SVM ranking models using a65a80a66a81a68 a37a2a70a19a53a71 a48 with a linear kernel in that space. Since the feature maps are not especially expensive for the kernels used here, we chose to solve the problem in its primal form. We were not aware of the existence of any fast software packages that could solve SVM ranking problems in the dual formulation. It is possible to convert the ranking problem into a classification problem using pairs of trees as shown in (Shen and Joshi, 2003). We have taken this approach in more recent work using string kernels requiring very expensive feature maps.</Paragraph> <Paragraph position="3"> We performed experiments using the version of the Redwoods corpus which was also used in the work of (Toutanova et al., 2002; Osborne and Baldbridge, 2004) and others. There are a0a7a30a2a1a1a0 annotated sentences in total, a30a3a2 a37a3a4 of which are ambiguous. The average sentence length of the ambiguous sentences is a0 a11a2 words and the average number of parses per sentence is a23 a1</Paragraph> <Paragraph position="5"> ambiguous sentences from the training and test sets.</Paragraph> <Paragraph position="6"> All models were trained and tested using 10-fold cross-validation. Accuracy results are reported as percentage of sentences where the correct analysis was ranked first by the model.</Paragraph> <Paragraph position="7"> The structure of the experiments section is as follows. First we describe the results from a controlled experiment using a limited number of features, and aimed at comparing models using local rule features to models using leaf projection paths in Section 4.1.</Paragraph> <Paragraph position="8"> Next we describe models using more sophisticated string kernels on projection paths in Section 4.2.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 The Leaf Projection Paths View versus the Context-Free Rule View </SectionTitle> <Paragraph position="0"> In order to evaluate the gains from the new representation, we describe the features of three similar models, one using the leaf projection paths, and two using derivation tree rules. Additionally, we train a model using only the features from the head-path parts of the projection paths to illustrate the gain of using the non-head path. As we will show, a model using only the head-paths has almost the same features as a rule-based tree model.</Paragraph> <Paragraph position="1"> All models here use derivation tree nodes annotated with only the rule schema name as in Figure 1 and the synsem.local.cat.head value. We will define these models by their feature map from trees to vectors. It will be convenient to define the feature maps for all models by defining the set of features through templates. The value a4 a51 a1a7a6 a9 for a feature a33 and tree a6 , will be the number of times a33 occurs in the tree. It is easy to show that the kernels on trees we introduce in Section 3.2, can be defined via a feature map that is the sum of the feature maps of the string kernels on projection paths.</Paragraph> <Paragraph position="2"> As a concrete example, for each model we show all features that contain the node [HCOMP:verb] from Figure 1, which covers the phrase plan on that.</Paragraph> <Paragraph position="3"> Bi-gram Model on Projection Paths (2PP) The features of this model use a projection path representation, where the keys are not the words, but the le-types of the words. The features of this model are defined by the following template : a1a60a63a65a64 a8 a7 a26 a64 a5 a57a6a5 a26 a64 a2 a5 a57a6a5 a26 a64 a2a8a7 a18 a5 a20 a1a10a9 a64 a59a1a26a12a11 a9 . a20 a1a10a9 a64a16a59a27a26a13a11 is a binary variable showing whether this feature matches a head or a non-head path, a63a65a64 a8 a7 a26 a64 is the le-type of the path leaf, and a57a6a5 a26 a64 a2 a5 a57a6a5 a26 a64 a2a8a7 a18 is a bi-gram from the path.</Paragraph> <Paragraph position="4"> The node [HCOMP:verb] is part of the head-path for plan, and part of the non-head path for on and that. The le-types of the words let, plan, on, and that are, with abbreviations, v sorb, v e p, p reg, and n deic pro sg respectively. In the following examples, the node labels are abbreviated as well; a14a16a15a18a17 is a special symbol for end of path and a65a19a15a18a17 is a special symbol for start of path. Therefore the features that contain the node will be: This model has a subset of the features of Model 2PP -- only those obtained by the head path parts of the projection paths. For our example, it contains the subset of features of 2PP that have last bit a23 , which will be only the following: the tuples is an indication of whether the tuple contains the le-type of the head or the non-head child as its first element. The features containing the node [HCOMP:verb] are ones from the expansion at that node and also from the expansion of its parent: tion path and rule representations.</Paragraph> <Paragraph position="5"> Rule Tree Model II (Rule II) This model splits the features of model Rule I in two parts, to mimic the features of the projection path models. It has features from the following templates: a1a60a63a60a64 a8 a7 a26 a64 a9 a64a16a59a27a26a10a5 a57a6a5 a26 a64 a5 a20 a64 a59a1a26 a41 a20 a20 a63a41a26 a5 a23 a9 and This model has less features than model Rule I, because it splits each rule into its head and non-head parts and does not have the two parts jointly. We can note that this model has all the features of 2HeadPP, except the ones involving start and end of path, due to the first template. The second template leads to features that are not even in 2PP because they connect the head and non-head paths of a word, which are represented as separate strings in 2PP.</Paragraph> <Paragraph position="6"> Overall, we can see that models Rule I and Rule II have the information used by 2HeadPP (and some more information), but do not have the information from the non-head parts of the paths in Model 2PP. Table 1 shows the average parse ranking accuracy obtained by the four models as well as the number of features used by each model. Model Rule I did not do better than model Rule II, which shows that joint representation of rule features was not very important. The large improvement of 2PP over 2HeadPP (13% error reduction) shows the usefulness of the non-head projection paths. The error reduction of 2PP over Rule I is also large - 9% error reduction. Further improvements over models using rule features were possible by considering more sophisticated string kernels and word keyed projection paths, as will be shown in the following sections.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Experimental Results using String Kernels on Projection Paths </SectionTitle> <Paragraph position="0"> In the present experiments, we have limited the derivation tree node annotation to the features listed in Table 2. Many other features from the HPSG signs nodes. The examples are from one node in the head path of the word let in Figure 1.</Paragraph> <Paragraph position="1"> are potentially helpful for disambiguation, and incorporating more useful features is a next step for this work. However, given the size of the corpus, a single model can not usefully profit from a large number of features. Previous work (Osborne and Baldbridge, 2004; Toutanova and Manning, 2002; Toutanova et al., 2002) has explored combining multiple classifiers using different features. We report results from such an experiment as well.</Paragraph> <Paragraph position="2"> Using Node Label and Head Category Annotations The simplest derivation tree node representation that we consider consists of features a1 and a23 schema name and category of the lexical head. All experiments in this subsection section were performed using this derivation tree annotation. We briefly mention results from the best string-kernels when using other node annotations, as well as a combination of models using different features in the following subsection.</Paragraph> <Paragraph position="3"> To evaluate the usefulness of our Repetition Kernel, defined in Section 3.3, we performed several simple experiments. We compared it to a a23 -gram kernel, and to a a37 -gram kernel. The results - number of features per model, and accuracy, are shown in Table 3. The models shown in this table include both features from projection paths keyed by words and projection paths keyed by le-types. The results show that the Repetition kernel achieves a noticeable improvement over a a23 -gram model (a0 a11a2a4a3 error reduction), with the addition of only a small number of features. For most of the words, repeated symbols will not occur in their paths, and the Repetition kernel will behave like a a23 -gram for the majority of cases. The additional information it captures about repeated symbols gives a sizable improvement. The bi-gram kernel performs better but at the cost of the addition of many features. It is likely that for large alphabets and small training sets, the Repetition kernel may outperform the bi-gram kernel.</Paragraph> <Paragraph position="4"> From this point on, we will fix the string kernel for projection paths keyed by words -- it will be a linear combination of a bi-gram kernel and a Rep- null etition kernel. We found that, because lexical information is sparse, going beyond a37 -grams for lexically headed paths was not useful. The projection paths keyed by le-types are much less sparse, but still capture important sequence information about the syntactic frames of words of particular lexical types.</Paragraph> <Paragraph position="5"> To study the usefulness of different string kernels on projection paths, we first tested models where only le-type keyed paths were represented, and then tested the performance of the better models when word keyed paths were added (with a fixed string kernel that interpolates a bi-gram and a Repetition kernel).</Paragraph> <Paragraph position="6"> Table 4 shows the accuracy achieved by several string kernels as well as the number of features (in thousands) they use. As can be seen from the table, the models are very sensitive to the discount factors used. Many of the kernels that use some combination of 1-grams and possibly discontinuous bi-grams performed at approximately the same accuracy level. Such are the wildcard(2,1,a32 ) and subseq(2,a56 ,a32 a18 ,a32 a3 ) kernels. Kernels that use a30 -grams have many more parameters, and even though they can be marginally better when using le-types only, their advantage when adding word keyed paths disappears. A limited amount of discontinuity in the Subsequence kernels was useful. Overall Sub-sequence kernels were slightly better than Wild-card kernels. The major difference between the two kinds of kernels as we have used them here is that the Subsequence kernel unifies features that have gaps in different places, and the Wildcard kernel does not. For example, a59a64a54 a67 a5 a54 a59 a67 a5 a59 a67 a54 are different features for Wildcard, but they are the same feature a59 a67 for Subsequence - only the weighting of the feature depends on the position of the wildcard.</Paragraph> <Paragraph position="7"> When projection paths keyed by words are added, the accuracy increases significantly. subseq(2,3,.5,2) achieved an accuracy of a2 a60 a11a4a3a2 a3 , which is much higher than the best previously published accuracy from a single model on this corpus (a2 a37 a11 a0 a3 for a model that incorporates more sources of information from the HPSG signs (Toutanova et al., 2002)). The error reduction compared to that</Paragraph> <Paragraph position="9"> keyed by le-type or both word and le-type. Numbers of features are shown in thousands.</Paragraph> <Paragraph position="10"> Baldbridge, 2004)).</Paragraph> <Paragraph position="11"> Other Features and Model Combination Finally, we trained several models using different derivation tree annotations and built a model that combined the scores from these models together with the best model subseq(2,3,.5,2) from Table 4. The combined model achieved our best accuracy of a2a3a0 a11a60 a3 . The models combined were: Model I A model that uses the Node Label and le-type of non-head daughter for head projection paths, and Node Label and sysnem.local.cat.head for non-head projection paths. The model used the subseq(2,3,.5,2) kernel for le-type keyed paths and bi-gram + Repetition for word keyed paths as above. non-head paths PS category of node. The model uses the same kernels as Model I. Number of features: 311K. Accuracy: a2 a37 a11 a0a2a0a4a3 . Model III This model uses PS label and sysnem.local.cat.head for head paths, and only PS label for non-head paths. The kernels are the same as Model I. Number of features: 165K Accuracy:</Paragraph> <Paragraph position="13"> Model IV This is a standard model based on rule features for local trees, with a37 levels of grandparent annotation and back-off. The annotation used at nodes was with Node Label and sysnem.local.cat.head. Number of features: 78K Accuracy: a2</Paragraph> <Paragraph position="15"/> </Section> </Section> class="xml-element"></Paper>