XML Viewer - c04-1190

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/c04-1190_metho.xml
Size: 21,496 bytes
Last Modified: 2025-10-06 14:08:46
<?xml version="1.0" standalone="yes"?>
<Paper uid="C04-1190">
  <Title>Semi-Supervised Training of a Kernel PCA-Based Model for Word Sense Disambiguation</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Related work
</SectionTitle>
    <Paragraph position="0"> The long history of WSD research includes numerous statistically trained methods; space only permits us to summarize a few key points here. Na&amp;quot;ive Bayes models (e.g., Mooney (1996), Chodorow et al. (1999), Pedersen (2001), Yarowsky and Florian (2002)) as well as maximum entropy models (e.g., Dang and Palmer (2002), Klein and Manning (2002)) in particular have shown a large degree of success for WSD, and have established challenging state-of-the-art benchmarks. The Senseval series of evaluations facilitates comparing the strengths and weaknesses of various WSD models on common data sets, with Senseval-1 (Kilgarriff and Rosenzweig, 1999), Senseval-2 (Kilgarriff, 2001), and Senseval-3 held in 1998, 2001, and 2004 respectively.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Supervised KPCA baseline model
</SectionTitle>
    <Paragraph position="0"> Our baseline WSD model is a supervised learning model that also makes use of Kernel Principal Component Analysis (KPCA), proposed by (Sch&amp;quot;olkopf et al., 1998) as a generalization of PCA. KPCA has been successfully applied in many areas such as de-noising of images of hand-written digits (Mika et al., 1999) and modeling the distribution of non-linear data sets in the context of shape modelling for real objects (Active Shape Models) (Twining and Taylor, 2001). In this section, we first review the theory of KPCA and explanation of why it is suited for WSD applications.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Kernel Principal Component Analysis
The Kernel Principal Component Analysis technique, or
</SectionTitle>
      <Paragraph position="0"> KPCA, is a method of nonlinear principal component extraction. A nonlinear function maps the n-dimensional input vectors from their original space Rn to a high-dimensional feature space F where linear PCA is performed. In real applications, the nonlinear function is usually not explicitly provided. Instead we use a kernel function to implicitly define the nonlinear mapping; in this respect KPCA is similar to Support Vector Machines (Sch&amp;quot;olkopf et al., 1998).</Paragraph>
      <Paragraph position="1"> Compared with other common analysis techniques, KPCA has several advantages: * As with other kernel methods it inherently takes combinations of predictive features into account when optimizing dimensionality reduction. For natural language problems in general, of course, it is widely recognized that significant accuracy gains can often be achieved by generalizing over relevant feature combinations (e.g., Kudo and Matsumoto (2003)).</Paragraph>
      <Paragraph position="2"> * We can select suitable kernel function according to the task we are dealing with and the knowledge we have about the task.</Paragraph>
      <Paragraph position="3"> * Another advantage of KPCA is that it is good at dealing with input data with very high dimensionality, a condition where kernel methods excel. Nonlinear principal components (Diamantaras and Kung, 1996) may be defined as follows. Suppose we are given a training set of M pairs (xt,ct) where the observed vectors xt [?] Rn in an n-dimensional input space X represent the context of the target word being disambiguated, and the correct class ct represents the sense of the word, for t = 1,..,M. Suppose Ph is a nonlinear mapping from the input space Rn to the feature space F. Without loss of generality we assume the M vectors are centered vectors in the feature space, i.e.,summationtext</Paragraph>
      <Paragraph position="5"> wish to diagonalize the covariance matrix in F:</Paragraph>
      <Paragraph position="7"> To do this requires solving the equation lv = Cv for eigenvalues l [?] 0 and eigenvectors v [?] F. Because</Paragraph>
      <Paragraph position="9"> and let ^l1 [?] ^l2 [?] ... [?] ^lM denote the eigenvalues of ^K and ^a1 ,..., ^aM denote the corresponding complete set of normalized eigenvectors, such that ^lt(^at * ^at) = 1  when ^lt &gt; 0. Then the lth nonlinear principal component of any test vector xt is defined as</Paragraph>
      <Paragraph position="11"> where ^ali is the lth element of ^al .</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Why is KPCA suited to WSD?
</SectionTitle>
      <Paragraph position="0"> The potential of nonlinear principal components for WSD can be illustrated by a simplified disambiguation example for the ambiguous target word &amp;quot;art&amp;quot;, with the two senses shown in Table 1. Assume a training corpus of the eight sentences as shown in Table 2, adapted from Senseval-2 English lexical sample corpus. For each sentence, we show the feature set associated with that occurrence of &amp;quot;art&amp;quot; and the correct sense class. These eight occurrences of &amp;quot;art&amp;quot; can be transformed to a binary vector representation containing one dimension for each feature, as shown in Table 3.</Paragraph>
      <Paragraph position="1"> Extracting nonlinear principal components for the vectors in this simple corpus results in nonlinear generalization, reflecting an implicit consideration of combinations of features. Table 2 shows the first three dimensions of the principal component vectors obtained by transforming each of the eight training vectors xt into (a) principal component vectors zt using the linear transform obtained via PCA, and (b) nonlinear principal component vectors yt using the nonlinear transform obtained via KPCA as described below.</Paragraph>
      <Paragraph position="2">  2001), together with a tiny example set of features. The training and testing examples can be represented as a set of binary vectors: each row shows the correct class c for an observed vector x of five dimensions. TRAINING design/N media/N the/DT entertainment/N world/N Class  tivated in some continental schools that began to affect England soon after the Norman Conquest were those of measurement and calculation. null  sign arts particularly, this led to appointments made for political rather than academic reasons.</Paragraph>
      <Paragraph position="4"> components as transformed via PCA and KPCA.</Paragraph>
      <Paragraph position="5"> Observed vectors PCA-transformed vectors KPCA-transformed vectors Class</Paragraph>
      <Paragraph position="7"> Similarly, for the test vector x9, Table 3 shows the first three dimensions of the principal component vectors obtained by transforming it into (a) a principal component vector z9 using the linear PCA transform obtained from training, and (b) a nonlinear principal component vector y9 using the nonlinear KPCA transform obtained obtained from training. The vector similarities in the KPCA-transformed space can be quite different from those in the PCA-transformed space. This causes the KPCA-based model to be able to make the correct</Paragraph>
      <Paragraph position="9"> class prediction, whereas the PCA-based model makes the wrong class prediction.</Paragraph>
      <Paragraph position="10"> What permits KPCA to apply stronger generalization biases is its implicit consideration of combinations of feature information in the data distribution from the high-dimensional training vectors. In this simplified illustrative example, there are just five input dimensions; the effect is stronger in more realistic high dimensional vector spaces. Since the KPCA transform is computed from unsupervised training vector data, and extracts generalizations that are subsequently utilized during supervised classification, it is possible to combine large amounts of unsupervised data with reasonable smaller amounts of supervised data.</Paragraph>
      <Paragraph position="11"> Interpreting this example graphically can be illuminating even though the interpretation in three dimensions is severely limiting. Figure 1(a) depicts the eight original observed training vectors xt in the first three of the five dimensions; note that among these eight vectors, there happen to be only four unique points when restricting our view to these three dimensions. Ordinary linear PCA can be straightforwardly seen as projecting the original points onto the principal axis, as can be seen for the case of the first principal axis in Figure 1(b). Note that in this space, the sense 2 instances are surrounded by sense 1 instances. We can traverse each of the projections onto the principal axis in linear order, simply by visiting each of the first principal components z1t along the principle axis in order of their values, i.e., such that</Paragraph>
      <Paragraph position="13"> It is significantly more difficult to visualize the non-linear principal components case, however. Note that in general, there may not exist any principal axis in X, since an inverse mapping from F may not exist. If we attempt to follow the same procedure to traverse each of the projections onto the first principal axis as in the case of linear PCA, by considering each of the first principal components y1t in order of their value, i.e., such that</Paragraph>
      <Paragraph position="15"> then we must arbitrarily select a &amp;quot;quasi-projection&amp;quot; direction for each y1t since there is no actual principal axis toward which to project. This results in a &amp;quot;quasi-axis&amp;quot; roughly as shown in Figure 1(c) which, though not precisely accurate, provides some idea as to how the non-linear generalization capability allows the data points to be grouped by principal components reflecting nonlinear patterns in the data distribution, in ways that linear  PCA cannot do. Note that in this space, the sense 1 instances are already better separated from sense 2 data points. Moreover, unlike linear PCA, there may be up to M of the &amp;quot;quasi-axes&amp;quot;, which may number far more than five. Such effects can become pronounced in the high dimensional spaces are actually used for real word sense disambiguation tasks.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 Algorithm
</SectionTitle>
      <Paragraph position="0"> To extract nonlinear principal components efficiently, note that in both Equations (5) and (6) the explicit form of Ph(xi) is required only in the form of (Ph(xi)*Ph(xj)), i.e., the dot product of vectors in F. This means that we can calculate the nonlinear principal components by substituting a kernel function k(xi,xj) for (Ph(xi)*Ph(xj )) in Equations (5) and (6) without knowing the mapping Ph explicitly; instead, the mapping Ph is implicitly defined by the kernel function. It is always possible to construct a mapping into a space where k acts as a dot product so long as k is a continuous kernel of a positive integral operator (Sch&amp;quot;olkopf et al., 1998).</Paragraph>
      <Paragraph position="1"> Thus we train the KPCA model using the following algorithm:  1. Compute an M xM matrix ^K such that</Paragraph>
      <Paragraph position="3"> 2. Compute the eigenvalues and eigenvectors of matrix  ^K and normalize the eigenvectors. Let ^l1 [?] ^l2 [?] ... [?] ^lM denote the eigenvalues and ^a1,..., ^aM denote the corresponding complete set of normalized eigenvectors.</Paragraph>
      <Paragraph position="4"> To obtain the sense predictions for test instances, we need only transform the corresponding vectors using the trained KPCA model and classify the resultant vectors using nearest neighbors. For a given test instance vector x, its lth nonlinear principal component is</Paragraph>
      <Paragraph position="6"> where ^ali is the ith element of ^al.</Paragraph>
      <Paragraph position="7"> For our disambiguation experiments we employ a polynomial kernel function of the form k(xi,xj) = (xi *xj)d, although other kernel functions such as gaussians could be used as well. Note that the degenerate case of d = 1 yields the dot product kernel k(xi,xj) = (xi*xj) which covers linear PCA as a special case, which may explain why KPCA always outperforms PCA.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Semi-supervised KPCA model
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Utilizing unlabeled data
</SectionTitle>
      <Paragraph position="0"> In WSD, as with many NLP tasks, features are often interdependent. For example, the features that represent words that frequently co-occur are typically highly interdependent. Similarly, the features that represent synonyms tend to be highly interdependent.</Paragraph>
      <Paragraph position="1"> It is a strength of the KPCA-based model that it generalizes over combinations of interdependent features.</Paragraph>
      <Paragraph position="2"> This enables the model to predict the correct sense even when the context surrounding a target word has not been previously seen, by exploiting the similarity to feature combinations that have been seen.</Paragraph>
      <Paragraph position="3"> However, in practice the labeled training corpus for WSD is typically relatively small, and does not yield enough training instances to reliably extract dependencies between features. For example, in the Senseval-2 English lexical sample data, for each target word there are only about 120 training instances on average, whereas on the other hand we typically have thousands of features for each target word.</Paragraph>
      <Paragraph position="4"> The KPCA model can fail when it encounters a target word whose context contains a combination of features that may in fact be interdependent, but are not similar to any combinations that occurred in the limited amounts of labeled training data. Because of the sparse data, the KPCA model wrongly considers the context of the target word to be dissimilar to those previously seen--even though the contexts may in truth be similar. In the absence of any contexts it believes to be similar, the model therefore tends simply to predict the most frequent sense.</Paragraph>
      <Paragraph position="5"> The potential solution we propose to this problem is to add much larger quantities of unannotated data, with which the KPCA model can first be trained in unsupervised fashion. This provides a significantly broader dataset from which to generalize over combinations of dependent features. One of the advantages of our WSD model is that during KPCA training, the sense class is not taken into consideration. Thus we can take advantage of the vast amounts of cheap unannotated corpora, in addition to the relatively small amounts of labeled training data. Adding a large quantity of unlabeled data makes it much likelier that dependent features can be identified during KPCA training.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Algorithm
</SectionTitle>
      <Paragraph position="0"> The primary difference of the semi-supervised KPCA model from the supervised KPCA baseline model described above lies in the eigenvector calculation step. As we mentioned earlier, in KPCA-based model, we need to calculate the eigenvectors of matrix K, where Kij = (Ph(xi)*Ph(xj )). In the supervised KPCA model, training vectors such as xi and xj are only drawn from the labeled training corpus. In the semi-supervised KPCA model, training vectors are drawn from both the labeled training corpus and a much larger unlabeled training corpus. As a consequence, the maximum number of eigenvectors in the supervised KPCA model is the minimum of the number of features and the number of vectors from the labeled training corpus, while the maximum number of eigenvectors for the semi-supervised KPCA model is the minimum of the number of features and total number of vectors from the combined labeled and unlabeled training corpora.</Paragraph>
      <Paragraph position="1"> However, one would not want to apply the semi-supervised KPCA model indiscriminately. While it can be expected to be valuable in cases where the data was too sparse for reliable training of the supervised KPCA model, at the same time it is important to note that the unlabeled data is typically drawn from quite different distributions than the labeled data, and may therefore be expected to introduce a new source of noise.</Paragraph>
      <Paragraph position="2"> We therefore define a composite semi-supervised KPCA model based on the following assumption. If we are sufficiently confident about the prediction made by the supervised KPCA model as to the predicted sense for the target word, we need not resort to the semi-supervised KPCA method. On the other hand, if we are not confident about the supervised KPCA model's prediction, we then turn to the semi-supervised KPCA model and take its classification as the predicted sense.</Paragraph>
      <Paragraph position="3"> Specifically, the composite model uses the following algorithm to combine the sense predictions of the supervised and semi-supervised KPCA models in order to dis- null ambiguate the target word in a given test instance x: 1. let s1 be the predicted sense of x using the supervised KPCA baseline model 2. let c be the similarity between x and its most similar training instance 3. if c [?] t or s1 negationslash= smf (where t is a preset thresh null old, and smf is the most frequent sense of the target word): * then predict the sense of the target word of x to be s1 * else predict the sense of the target word of x to be s2, the sense predicted by the semi-supervised KPCA model The two conditions checked in step 3 serve to filter those instances where the supervised KPCA baseline model is confident enough to skip the semi-supervised KPCA model. In particular: * The threshold t specifies a minimum level of the supervised KPCA baseline model's confidence, in terms of similarity. If c [?] t, then there were training instances that were of sufficient similarity to the test instance so that the model can be confident that a correct disambiguation can be predicted based only on those similar training instances. In this case the semi-supervised KPCA model is not needed.</Paragraph>
      <Paragraph position="4"> * If s1 is not the most frequent sense smf of the target word, then there is strong evidence that the test instance should be disambiguated as s1 because this is overriding an otherwise strong tendency to disambiguate the target word to the most frequent sense. Again, in this case the semi-supervised KPCA model should be avoided.</Paragraph>
      <Paragraph position="5"> The threshold t is defined to rise as the relative frequency of the most frequent sense falls. Specifically,</Paragraph>
      <Paragraph position="7"> most frequent sense in the training corpus and c is a small constant. This reflects the assumption that the higher the probability of the most frequent sense, the less likely that a test instance disambiguated as the most frequent sense is wrong.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Experimental setup
</SectionTitle>
    <Paragraph position="0"> We evaluated the composite semi-supervised KPCA model using data from the Senseval-2 English lexical sample task (Kilgarriff, 2001)(Palmer et al., 2001). We chose to focus on verbs, which have proven particularly difficult to disambiguate. Our task consists in disambiguating several instances of 16 different target verbs.  supervised na&amp;quot;ive Bayes and maximum entropy models, as well as the most-frequent-sense and supervised KPCA baseline models.</Paragraph>
    <Paragraph position="1">  For each target word, training and test instances manually tagged with WordNet senses are available. There are an average of about 10.5 senses per target word, ranging from 4 to 19. All our models are evaluated on the Senseval-2 test data, but trained on different training sets. We report accuracy, the number of correct predictions over the total number of test instances, at two different levels of sense granularity.</Paragraph>
    <Paragraph position="2"> The supervised models are trained on the Senseval-2 training data. On average, 137 annotated training instances per target word are available.</Paragraph>
    <Paragraph position="3"> In addition to the small annotated Senseval-2 data set, the semi-supervised KPCA model can make use of large amounts of unannotated data. Since most of the Senseval-2 verb data comes from the Wall Street Journal, we choose to augment the Senseval-2 data by collecting additional training instances from the Wall Street Journal Tipster corpus. In order to minimize the noise during KPCA learning, we only extract the sentences in which the target word occurs. For each target word, up to 1500 additional training instances were extracted. The resulting training corpus for the semi-supervised KPCA model is more than 10 times larger than the Senseval-2 training set, with an average of 1637 training instances per target word.</Paragraph>
    <Paragraph position="4"> The set of features used is as described by Yarowsky and Florian (2002) in their &amp;quot;feature-enhanced na&amp;quot;ive Bayes model&amp;quot;, with position-sensitive, syntactic, and local collocational features.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML