File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-2408_metho.xml

Size: 11,961 bytes

Last Modified: 2025-10-06 14:09:24

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-2408">
  <Title>Modeling Category Structures with a Kernel Function</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Kernels from Probabilistic Models
</SectionTitle>
    <Paragraph position="0"> Recently new type of kernels which connect generative models of data and discriminative classifiers such as SVMs, have been proposed: the Fisher kernel (Jaakkola and Haussler, 1998) and the TOP (Tangent vector Of the Posterior log-odds) kernel (Tsuda et al., 2002).</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Fisher Kernel
</SectionTitle>
      <Paragraph position="0"> Suppose we have a probabilistic generative model p(xj ) of the data (we denote an example by x). The Fisher score of x is defined as r logp(xj ), where r means partial differentiation with respect to the parameters . The Fisher information matrix is denoted by I( ) (this matrix defines the geometric structure of the model space).</Paragraph>
      <Paragraph position="1"> Then, the Fisher kernel at an estimate ^ is given by:</Paragraph>
      <Paragraph position="3"> The Fisher score of an example approximately indicates how the model will change if the example is added to the training data used in the estimation of the model. That means, the Fisher kernel between two examples will be large, if the influences of the two examples to the model are similar and large (Tsuda and Kawanabe, 2002).</Paragraph>
      <Paragraph position="4"> The matrix I( ) is often approximated by the identity matrix to avoid large computational overhead.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 TOP Kernel
</SectionTitle>
      <Paragraph position="0"> On the basis of a probabilistic model of the data, TOP kernels are designed to extract feature vectors f^ which are considered to be useful for categorization with a separating hyperplane.</Paragraph>
      <Paragraph position="1"> We begin with the proposition that, between the generalization error R(f^ ) and the expected error of the posterior probability D(f^ ), the relation R(f^ )!L/ * 2D(f^ ) holds, where L/ is the Bayes error. This inequality means that minimizing D(f^ ) leads to reducing the generalization error R(f^ ). D(f^ ) is expressed, using a logistic</Paragraph>
      <Paragraph position="3"> where / denotes the actual parameters of the model.</Paragraph>
      <Paragraph position="4"> The TOP kernel consists of features which can minimize D(f^ ). In other words, we would like to have feature vectors f^ that satisfy the following: 8x; w C/ f^ (x)!b = F!1(P(y = +1jx; /)): (5) for certain values of w and b.</Paragraph>
      <Paragraph position="5"> For that purpose, we first define a function v(x; ):</Paragraph>
      <Paragraph position="7"> then (5) is approximately satisfied. Thus, the TOP kernel is defined as</Paragraph>
      <Paragraph position="9"> A detailed discussion of the TOP kernel and its theoretical analysis have been given by Tsuda et al (Tsuda et al., 2002).</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Related Work
</SectionTitle>
    <Paragraph position="0"> Hofmann (2000) applied Fisher kernels to text categorization under the Probabilistic Latent Semantic Indexing (PLSI) model (Hofmann, 1999).</Paragraph>
    <Paragraph position="1"> In PLSI, the joint probability of document d and word</Paragraph>
    <Paragraph position="3"> where variables zk correspond to latent classes. After the estimation of the model using the EM algorithm, the Fisher kernel for this model is computed. The average log-likelihood of document d normalized by the document length is given by</Paragraph>
    <Paragraph position="5"> They use spherical parameterization (Kass and Vos, 1997) instead of the original parameters in the model.</Paragraph>
    <Paragraph position="6"> They define parameters %0jk = 2pP(wjjzk) and %0k = 2pP(zk), and obtained</Paragraph>
    <Paragraph position="8"> Thus, the Fisher kernel for this model is obtained as described in Appendix A.</Paragraph>
    <Paragraph position="9"> The first term of (31) corresponds to the similarity through latent spaces. The second term corresponds to the similarity through the distribution of each word. The number of latent classes zk can affect the value of the kernel function. In the experiment of (Hofmann, 2000), they computed the kernels with the different numbers (1 to 64) of zk and added them together to make a robust kernel instead of deciding one specific number of latent classes zk.</Paragraph>
    <Paragraph position="10"> They concluded that the Fisher kernel based on PLSI is effective when a large amount of unlabeled examples are available for the estimation of the PLSI model.</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Hyperplane-based TOP Kernel
</SectionTitle>
    <Paragraph position="0"> In this section, we explain our TOP kernel.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.1 Derivation of HP-TOP kernel
</SectionTitle>
      <Paragraph position="0"> Suppose we have obtained the parameters wc and bc of the separating hyperplane for each category c 2 Ccategory in the original feature space, where Ccategory denotes the set of categories.</Paragraph>
      <Paragraph position="1"> We assume that the class-posteriors Pc(+1jd) and</Paragraph>
      <Paragraph position="3"> where, for any category x, component function q(djx) is of Gaussian-type:</Paragraph>
      <Paragraph position="5"> with the mean ,,x of a random variable wx C/ d ! bx and the variance x. Those parameters are estimated with the maximum likelihood estimation, as follows:</Paragraph>
      <Paragraph position="7"> We choose the Gaussian-type function as an example.However, this choice is open to argument, since some other models also have the same computational advantage as described in Section 5.4.</Paragraph>
      <Paragraph position="8"> We set x1 = ,,x= 2x, x2 = !1=2 2x. Although x1 and x2 are not the natural parameters of this model, 1We cannot say q(djx) is a generative probability of d given class x, because it is one-dimensional and not valid as a probability density in the original feature space.</Paragraph>
      <Paragraph position="9"> we parameterize this model using the parameters x1, x2, wx, bx and P(x) (8x 2 Ccategory) for simplicity. Using this probabilistic model,we compute function v(d; ) as described in Appendix B ( denotes fwx;bx; x1; x2jx 2 Ccategoryg and wxi denotes the i-th element of the weight vector wx).</Paragraph>
      <Paragraph position="10"> The partial derivatives of this function with respect to the parameters are in Appendix C.</Paragraph>
      <Paragraph position="11"> Then we can follow the definition (10) to obtain our version of the TOP kernel. We call this new kernel a hyperplane-based TOP (HP-TOP) kernel.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.2 Properties of HP-TOP kernel
</SectionTitle>
      <Paragraph position="0"> In the derivatives (39), which provide the largest number of features, original features di are accompanied by other factors computed from probability distributions. This form suggests that two vectors are considered to be more similar, if they have similar distributions over categories.</Paragraph>
      <Paragraph position="1"> In other words, an occurrence of a word can have different contribution to the classification result, depending on the context (i.e., the other words in the document).</Paragraph>
      <Paragraph position="2"> This property of the HP-TOP kernel can lead to the effect of word sense disambiguation, because &amp;quot;bank&amp;quot; in a financial document is treated differently from &amp;quot;bank&amp;quot; in a document related to a river-side park.</Paragraph>
      <Paragraph position="3"> The derivatives (34) and (35) correspond to the first-order differences, respectively for the positive class and the negative class. Similarly, the derivatives (36) and (37) for the second-order differences. The derivatives (40) and (41) are for the first-order differences normalized by the variances.</Paragraph>
      <Paragraph position="4"> The derivatives other than (38) and (38) directly depend on the distance from a hyperplane, rather than on the value of each feature. These derivatives enrich the feature set, when there are few active words, by which we mean the words that do not occur in the training data.</Paragraph>
      <Paragraph position="5"> For this reason, we expect that the HP-TOP kernel works well for a small training dataset.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.3 Computational issue
</SectionTitle>
      <Paragraph position="0"> Computing the kernel in this form is time-consuming, because the number of components of type (39) can be very large:</Paragraph>
      <Paragraph position="2"> where I denotes the set of indices for original features.</Paragraph>
      <Paragraph position="3"> However, we can avoid this heavy computational cost as follows. Let us compute the dot-product of derivatives (39) of two vectors d1 and d2, which is shown in Appendix D. The last expression (45) is regarded as the scalar product of two dot-products. Thus, by preserving vectors d and</Paragraph>
      <Paragraph position="5"> we can efficiently compute the dot-product in (39); the computational complexity of a kernel function is O(jIj); (23) on the condition that the original dimension is larger than the number of categories. Thus, from the viewpoint of computational time, our kernel has an advantage over some other kernels such as the PLSI-based Fisher kernel in Section 4, which requires the computational complexity of O(jIjPSjCclusterj), where Ccluster denotes the set of clusters.</Paragraph>
      <Paragraph position="6"> In the PLSI-based Fisher kernel, each word has a probability distribution over latent classes. In this sense, the PLSI-based Fisher kernel is more detailed, but detailed models are sometimes suffer overfitting to the training data and have the computational disadvantage as mentioned above.</Paragraph>
      <Paragraph position="7"> The PLSI-based Fisher kernel can be extended to a TOP kernel by using given categories as latent classes. However, the problem of computational time still remains. null</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.4 General statement about the computational
advantage
</SectionTitle>
      <Paragraph position="0"> So far, we have discussed the computational time for the kernel constructed on the Gaussian mixture. However, the computational advantage of the kernel, in fact, is shared by a more general class of models.</Paragraph>
      <Paragraph position="1"> We examine the required conditions for the computational advantage. Suppose the class-posteriors have the mixture form as Equations (16) and (17), but function q(djx) does not have to be a Gaussian-type function. Instead, function q(djx) is supposed to be represented using some function r parametrized by we and b, as:</Paragraph>
      <Paragraph position="3"> The first two factors of (25) do not depend on i. Therefore, if the last factor of (25) is variable-separable with respect to e and i:</Paragraph>
      <Paragraph position="5"> where S and T are some function, then the derivative (25) is also variable-separable. In such cases, the efficient computation described in Section 5.3 is possible by preserving the vectors:</Paragraph>
      <Paragraph position="7"> We have now obtained the required conditions for the efficient computation: Equation (24) and the variableseparability. null In case of Gaussian-type functions, function fe and its derivative with respect to wei are</Paragraph>
      <Paragraph position="9"> Thus, the conditions are satisfied.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML