File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/w05-0624_metho.xml

Size: 10,335 bytes

Last Modified: 2025-10-06 14:09:55

<?xml version="1.0" standalone="yes"?>
<Paper uid="W05-0624">
  <Title>Sparse Bayesian Classification of Predicate Arguments</Title>
  <Section position="4" start_page="0" end_page="177" type="metho">
    <SectionTitle>
2 Sparse Bayesian Learning and the
Relevance Vector Machine
</SectionTitle>
    <Paragraph position="0"> The Sparse Bayesian method is described in detail in (Tipping, 2001). Like other generalized linear learning methods, the resulting binary classifier has the form</Paragraph>
    <Paragraph position="2"> where the f i are basis functions. Training the model then consists of finding a suitable a =</Paragraph>
    <Paragraph position="4"> ) given a data set (X,Y ).</Paragraph>
    <Paragraph position="5"> Analogous with the SVM approach, we can let</Paragraph>
    <Paragraph position="7"> is an example from the training set and k a function. We have then arrived at the Relevance Vector Machine (RVM). There are however no restrictions on the function k (such as Mercer's condition for SVM). We use the Gaussian kernel k(x,y)=exp([?]gbardblx [?] ybardbl  ) throughout this work.</Paragraph>
    <Paragraph position="8"> We first model the probability of a positive example as a sigmoid applied to f(x). This can be used to write the likelihood function P(Y |X,a). Instead of a conventional ML approach (maximizing the likelihood with respect to a, which would give an overfit model), we now adopt a Bayesian approach and encode the model preferences using priors on a. For each a</Paragraph>
    <Paragraph position="10"> sian). This is in effect an &amp;quot;Occam penalty&amp;quot; that encodes our preference for sparse models. We should finally specify the distributions of the s</Paragraph>
    <Paragraph position="12"> we make the simplifying assumption that their distribution is flat (noninformative).</Paragraph>
    <Paragraph position="13"> We now find the maximum of the marginal likelihood, or &amp;quot;evidence&amp;quot;, with respect to s,thatis</Paragraph>
    <Paragraph position="15"> This integral is not tractable, hence we approximate the integrand using a Gaussian centered at the mode of the integrand (Laplace's approximation). The marginal likelihood can then be differentiated with respect to s, and maximized using iterative methods such as gradient descent.</Paragraph>
    <Paragraph position="16"> The algorithm thus proceeds iteratively as follows: First maximize the penalized likelihood function P(Y |X,a)p(a|s) with respect to a (for example via the Newton-Raphson method), then update the parameters s</Paragraph>
    <Paragraph position="18"> . This goes on until a convergence criterion is met, for example that the s</Paragraph>
    <Paragraph position="20"> changes are small enough. During iteration, the s</Paragraph>
    <Paragraph position="22"> parameters for redundant examples tend to infinity.</Paragraph>
    <Paragraph position="23"> They (and the corresponding columns of the kernel matrix) are then removed from the model. This is necessary because of numerical stability and also reduces the training time considerably.</Paragraph>
    <Paragraph position="24"> We implemented the RVM training method using the ATLAS (Whaley et al., 2000) implementation of the BLAS and LAPACK standard linear algebra APIs. To make the algorithm scale up, we used a working-set strategy that used the results of partial solutions to train the final classifier. Our implementation is based on the original description of the algorithm (Tipping, 2001) rather than the greedy optimized version (Tipping and Faul, 2003), since preliminary experiments suggested a decrease in classification accuracy. Our current implementation can handle training sets up to about 30000 examples.</Paragraph>
    <Paragraph position="25"> We used the conventional one-versus-one method for multiclass classification. Although the Sparse Bayesian paradigm is theoretically not limited to binary classifiers, this is of little use in practice, since the size of the Hessian matrix (used while maximizing the likelihood and updating s) grows with the number of classes.</Paragraph>
  </Section>
  <Section position="5" start_page="177" end_page="179" type="metho">
    <SectionTitle>
3 System Description
</SectionTitle>
    <Paragraph position="0"> Like previous systems for semantic role identification and classification, we used an approach based on classification of nodes in the constituent tree.</Paragraph>
    <Paragraph position="1"> To simplify training, we used the soft-prune approach as described in (Pradhan et al., 2005), which means that before classification, the nodes were filtered through a binary classifier that classifies them as having a semantic role or not (NON-NULL or NULL). The NULL nodes missed by the filter were included in the training set for the final classifier.</Paragraph>
    <Paragraph position="2"> Since our current implementation of the RVM training algorithm does not scale up to large training sets, training on the whole PropBank was infeasible.</Paragraph>
    <Paragraph position="3"> We instead trained the multiclass classifier on sections 15 - 18, and used an SVM for the soft-pruning classifier, which was then trained on the remaining sections. The excellent LIBSVM (Chang and Lin, 2001) package was used to train the SVM.</Paragraph>
    <Paragraph position="4"> The features used by the classifiers can be grouped into predicate and node features. Of the node features, we here pay most attention to the parse tree path features.</Paragraph>
    <Section position="1" start_page="177" end_page="178" type="sub_section">
      <SectionTitle>
3.1 Predicate Features
</SectionTitle>
      <Paragraph position="0"> We used the following predicate features, all of which first appeared in (Gildea and Jurafsky, 2002).</Paragraph>
      <Paragraph position="1">  work, we used the head rules of Collins to ex- null tract this feature.</Paragraph>
      <Paragraph position="2"> * Position. A binary feature that describes if the node is before or after the predicate token. * Phrase type (PT), that is the label of the constituent. null * Named entity. Type of the first contained NE. * Governing category. As in (Gildea and Jurafsky, 2002), this was used to distinguish subjects from objects. For an NP, this is either S or VP. * Path features. (See next subsection.)  For prepositional phrases, we attached the preposition to the PT and replaced head word and head POS with those of the first contained NP.</Paragraph>
    </Section>
    <Section position="2" start_page="178" end_page="178" type="sub_section">
      <SectionTitle>
3.3 Parse Tree Path Features
</SectionTitle>
      <Paragraph position="0"> Previous studies have shown that the parse tree path feature, used by almost all systems since (Gildea and Jurafsky, 2002), is salient for argument identification. However, it is extremely sparse (which makes the system learn slowly) and is dependent on the quality of the parse tree. We therefore investigated the contribution of the following features in order to come up with a combination of path features that leads to a robust system that generalizes well.</Paragraph>
      <Paragraph position="1"> * Constituent tree path. As in (Gildea and Jurafsky, 2002), this feature represents the path (consisting of step directions and PTs of the nodes traversed) from the node to the predicate, for example NP|VP|VB for a typical object.</Paragraph>
      <Paragraph position="2"> Removing the direction (as in (Pradhan et al., 2005)) improved neither precision nor recall.</Paragraph>
      <Paragraph position="3"> * Partial path. To reduce sparsity, we introduced a partial path feature (as in (Pradhan et al., 2005)), which consists of the path from the node to the lowest common ancestor.</Paragraph>
      <Paragraph position="4"> * Dependency tree path. We believe that labeled dependency paths provide more information about grammatical functions (and, implicitly, semantic relationships) than the raw constituent structure. Since the grammatical functions are not directly available from the parse trees, we investigated two approximations of dependency arc labels: first, the POSs of the head tokens; secondly, the PTs of the head node and its immediate parent (such labels were used in (Ahn et al., 2004)).</Paragraph>
      <Paragraph position="5"> * Shallow path. Since the UPC shallow parsers were expected to be more robust than the full parsers, we used a shallow path feature. We first built a parse tree using clause and chunk bracketing, and the shallow path feature was then constructed like the constituent tree path.</Paragraph>
      <Paragraph position="6"> * Subpaths. All subpaths of the constituent path.</Paragraph>
      <Paragraph position="7"> We used the parse trees from Charniak's parser to derive all paths except for the shallow path.</Paragraph>
    </Section>
    <Section position="3" start_page="178" end_page="178" type="sub_section">
      <SectionTitle>
4Results
4.1 Comparison with SVM
</SectionTitle>
      <Paragraph position="0"> The binary classifiers that comprise the one-versus-one multiclass classifier were 89% - 98% smaller when using RVM compared to SVM. However, the performance dropped by about 2 percent. The reason for the drop is possibly that the classifier uses a number of features with extremely sparse distributions (two word features and three path features).</Paragraph>
    </Section>
    <Section position="4" start_page="178" end_page="178" type="sub_section">
      <SectionTitle>
4.2 Path Feature Contributions
</SectionTitle>
      <Paragraph position="0"> To estimate the contribution of each path feature, we measured the difference in performance between a system that used all six features and one where one of the features had been removed. Table 2 shows the results for each of the six features. For the final system, we used the dependency tree path with PT pairs, the shallow path, and the partial path.</Paragraph>
    </Section>
    <Section position="5" start_page="178" end_page="179" type="sub_section">
      <SectionTitle>
4.3 Final System Results
</SectionTitle>
      <Paragraph position="0"> The results of the complete system on the test sets are shown in Table 1. The smaller training set (as mentioned above, we used only sections 15 - 18  the WSJ test (bottom).</Paragraph>
      <Paragraph position="1"> for the role classifier) causes the result to be significantly lower than state of the art (F-measure of 79.4, reported in (Pradhan et al., 2005)).</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML