XML Viewer - w06-2608

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-2608_metho.xml
Size: 14,176 bytes
Last Modified: 2025-10-06 14:10:53
<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-2608">
  <Title>Syntagmatic Kernels: a Word Sense Disambiguation Case Study</Title>
  <Section position="3" start_page="57" end_page="58" type="metho">
    <SectionTitle>
2 Sequence Kernels
</SectionTitle>
    <Paragraph position="0"> The basic idea behind kernel methods is to embed the data into a suitable feature space F via a mapping function ph : X - F, and then use a linear algorithm for discovering nonlinear patterns. Instead of using the explicit mapping ph, we can use a kernel function K : X xX - R, that corresponds to the inner product in a feature space which is, in general, different from the input space.</Paragraph>
    <Paragraph position="1"> Kernel methods allow us to build a modular system, as the kernel function acts as an interface between the data and the learning algorithm. Thus the kernel function becomes the only domain specific module of the system, while the learning algorithm is a general purpose component. Potentially any kernel function can work with any kernel-based algorithm. In our system weuse Support Vector Machines (Cristianini and Shawe-Taylor, 2000).</Paragraph>
    <Paragraph position="2"> Sequence Kernels (or String Kernels) are a family of kernel functions developed to compute the inner product among images of strings in high-dimensional feature space using dynamic programming techniques (Shawe-Taylor and Cristianini, 2004). The Gap-Weighted Subsequences Kernel is the most general Sequence Kernel. Roughly speaking, it compares two strings by means of the number of contiguous and non-contiguous substrings of a given length they have in common. Non contiguous occurrences are penalized according to the number of gaps they contain.</Paragraph>
    <Paragraph position="3">  Formally, let S be an alphabet of |S |symbols, and s = s1s2 ...s|s |a finite sequence over S (i.e.</Paragraph>
    <Paragraph position="4"> si [?] S,1 lessorequalslant i lessorequalslant |s|). Let i = [i1,i2,...,in], with 1 lessorequalslant i1 &lt; i2 &lt; ... &lt; in lessorequalslant |s|, be a subset of the indices in s: we will denote as s[i] [?] Sn the sub-sequence si1si2 ...sin. Note that s[i] does not necessarily form a contiguous subsequence of s. For example, if s is the sequence &amp;quot;Ronaldo scored the goal&amp;quot; and i = [2,4], then s[i] is &amp;quot;scored goal&amp;quot;. The length spanned by s[i] in s is l(i) = in [?] i1 + 1.</Paragraph>
    <Paragraph position="5"> The feature space associated with the Gap-Weighted Subsequences Kernel of length n is indexed by I = Sn, with the embedding given by</Paragraph>
    <Paragraph position="7"> where l [?]]0,1] is the decay factor used to penalize non-contiguous subsequences1. The associate kernel is defined as</Paragraph>
    <Paragraph position="9"> An explicit computation of Equation 2 is unfeasible even for small values of n. To evaluate more efficiently Kn, weusetherecursive formulation proposed in (Lodhi et al., 2002; Saunders et al., 2002; Cancedda et al., 2003) based on a dynamic programming implementation. It is reported in the following equations:</Paragraph>
    <Paragraph position="11"> Kprimen and Kprimeprimen are auxiliary functions with a similar definition as Kn used to facilitate the computation. Based on all definitions above, Kn can be 1Notice that by choosing l = 1 sparse subsequences are not penalized. On the other hand, the kernel does not take into account sparse subsequences with l - 0.</Paragraph>
    <Paragraph position="12"> computed in O(n|s||t|). Using the above recursive definition, it turns out that computing all kernel values for subsequences of lengths up to n is not significantly more costly than computing the kernel for n only.</Paragraph>
    <Paragraph position="13"> In the rest of the paper we will use the normalised version of the kernel (Equation 10) to keep the values comparable for different values of n and to be independent from the length of the sequences.</Paragraph>
    <Paragraph position="15"/>
  </Section>
  <Section position="4" start_page="58" end_page="59" type="metho">
    <SectionTitle>
3 The Syntagmatic Kernel
</SectionTitle>
    <Paragraph position="0"> As stated in Section 1, syntagmatic relations hold among words arranged in a particular temporal order, hence they can be modeled by Sequence Kernels. The Syntagmatic Kernel is defined as a linear combination of Gap-Weighted Subsequences Kernels thatoperate atwordandPoStaglevel. Inparticular, following the approach proposed by Cancedda et al. (2003), it is possible to adapt sequence kernels to operate at word level by instancing the alphabet S with the vocabulary V = {w1,w2,...,wk}. Moreover, we restricted the generic definition of the Gap-Weighted Subsequences Kernel to recognize collocations in the local context of a specified word. The resulting kernel, called n-gram Collocation Kernel (KnColl), operates on sequences of lemmata around a specified word l0 (i.e.l[?]3,l[?]2,l[?]1,l0,l+1,l+2,l+3).</Paragraph>
    <Paragraph position="1"> Thisformulation allowsustoestimate thenumber of common (sparse) subsequences of lemmata (i.e. collocations) between twoexamples, inorder tocapture syntagmatic similarity.</Paragraph>
    <Paragraph position="2"> Analogously, we defined the PoS Kernel (KnPoS) to operate on sequences of PoS tags p[?]3, p[?]2, p[?]1, p0, p+1, p+2, p+3, where p0 is the PoS tag of l0.</Paragraph>
    <Paragraph position="3"> The Collocation Kernel and the PoS Kernel are defined by Equations 11 and 12, respectively.</Paragraph>
    <Paragraph position="5"> Both kernels depend on the parameter n, the length of the non-contiguous subsequences, and l, the de- null cay factor. For example, K2Coll allows us to represent all (sparse) bi-grams in the local context of a word.</Paragraph>
    <Paragraph position="6"> Finally, the Syntagmatic Kernel is defined as</Paragraph>
    <Paragraph position="8"> We will show that in WSD, the Syntagmatic Kernel is more effective than standard bigrams and tri-grams of lemmata and PoS tags typically used as features.</Paragraph>
  </Section>
  <Section position="5" start_page="59" end_page="60" type="metho">
    <SectionTitle>
4 Soft-Matching Criteria
</SectionTitle>
    <Paragraph position="0"> In the definition of the Syntagmatic Kernel only exact word matches contribute to the similarity. To overcome this problem, we further extended the definition of the Gap-Weigthed Subsequences Kernel given in Section 2 to allow soft-matching between words. In order to develop soft-matching criteria, we follow the idea that two words can be substituted preserving the meaning of the whole sentence if they are paradigmatically related (e.g. synomyns, hyponyms or domain related words). If the meaning of the proposition as a whole is preserved, the meaning of the lexical constituents of the sentence will necessarily remain unchanged too, providing a viable criterion todefineasoft-matching schema. This can be implemented by &amp;quot;plugging&amp;quot; external paradigmatic information into the Collocation kernel.</Paragraph>
    <Paragraph position="1"> Following the approach proposed by (Shawe-Taylor and Cristianini, 2004), the soft-matching Gap-Weighted Subsequences Kernel is now calculated recursively using Equations 3 to 5, 7 and 8, replacing Equation 6 by the equation:</Paragraph>
    <Paragraph position="3"> where axy are entries in a similarity matrix A between symbols (words). In order to ensure that the resulting kernel is valid, A must be positive semidefinite. null In the following subsections, we describe two alternative soft-matching criteria based on WordNet Synonymy and Domain Proximity. In both cases, to show that the similarity matrices are a positive semi-definite we use the following result: Proposition 1 A matrix A is positive semi-definite if and only if A = BTB for some real matrix B.</Paragraph>
    <Paragraph position="4"> The proof is given in (Shawe-Taylor and Cristianini, 2004).</Paragraph>
    <Section position="1" start_page="59" end_page="59" type="sub_section">
      <SectionTitle>
4.1 WordNet Synonymy
</SectionTitle>
      <Paragraph position="0"> The first solution we have experimented exploits a lexical resource representing paradigmatic relations among terms, i.e. WordNet. In particular, we used WordNet-1.7.1 for English and the Italian part of MultiWordNet2.</Paragraph>
      <Paragraph position="1"> In order to find a similarity matrix between terms, we defined a vector space where terms are represented by the WordNet synsets in which such terms appear. Hence, we can view a term as vector in which each dimension is associated with one synset.</Paragraph>
      <Paragraph position="2"> The term-by-synset matrix S is then the matrix whose rows are indexed by the synsets. The entry xij of S is 1 if the synset sj contains the term wi, and 0 otherwise. The term-by-synset matrix S gives rise to the similarity matrix A = SST between terms. Since A can be rewritten as A = (ST)TST = BTB, it follows directly by Proposition 1 that it is positive semi-definite.</Paragraph>
      <Paragraph position="3"> It is straightforward to extend the soft-matching criterion to include hyponym relation, but we achieved worse results. In the evaluation section we will not report such results.</Paragraph>
    </Section>
    <Section position="2" start_page="59" end_page="60" type="sub_section">
      <SectionTitle>
4.2 Domain Proximity
</SectionTitle>
      <Paragraph position="0"> The approach described above requires a large scale lexical resource. Unfortunately, formanylanguages, such a resource is not available. Another possibility for implementing soft-matching is introducing the notion of Semantic Domains.</Paragraph>
      <Paragraph position="1"> Semantic Domains are groups of strongly paradigmatically related words, and can be acquired automatically from corpora in a totally unsupervised way (Gliozzo, 2005). Our proposal is to exploit a Domain Proximity relation to define a soft-matching criterion on the basis of an unsupervised similarity metric defined in a Domain Space. The Domain Space can be determined once a Domain  Model (DM) is available. This solution is evidently cheaper, because large collections of unlabeled texts can be easily found for every language.</Paragraph>
      <Paragraph position="2"> A DM is represented by a k xkprime rectangular matrix D, containing the domain relevance for each term with respect to each domain, as illustrated in  ing a lexical coherence assumption (Gliozzo, 2005). Tothisaim, TermClustering algorithms canbeused: a different domain is defined for each cluster, and thedegree ofassociation between termsandclusters, estimated by the unsupervised learning algorithm, provides a domain relevance function. As a clustering technique we exploit Latent Semantic Analysis (LSA), following the methodology described in (Gliozzo et al., 2005b). This operation is done offline, and can be efficiently performed on large corpora. null LSA is performed by means of SVD of the term-by-document matrixTrepresenting the corpus. The SVDalgorithm can be exploited to acquire adomain matrix D from a large corpus in a totally unsupervised way. SVD decomposes the term-by-document matrix T into three matrices T = VSkUT where Sk is the diagonal k x k matrix containing the k singular values of T. D = VSkprime where kprime lessmuch k. Once a DM has been defined by the matrixD, the Domain Space is a kprime dimensional space, in which both texts and terms are represented by means of Domain Vectors (DVs), i.e. vectors representing the domain relevances among the linguistic object and each domain. The DV vectorwprimei for the term wi [?] V is the ith row of D, where V = {w1,w2,...,wk} is the vocabulary of the corpus.</Paragraph>
      <Paragraph position="3"> The term-by-domain matrix D gives rise to the term-by-term similarity matrix A = DDT among terms. It follows from Proposition 1 that A is positive semi-definite.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="60" end_page="61" type="metho">
    <SectionTitle>
5 Kernel Combination for WSD
</SectionTitle>
    <Paragraph position="0"> To improve the performance of a WSD system, it is possible to combine different kernels. Indeed, we followed this approach in the participation to Senseval-3 competition, reaching the state-of-the-art in many lexical-sample tasks (Strapparava et al., 2004). While this paper is focused on Syntagmatic Kernels, in this section we would like to spend some words on another important component for a complete WSD system: the Domain Kernel, used to model domain relations.</Paragraph>
    <Paragraph position="1"> Syntagmatic information alone is not sufficient to define a full kernel for WSD. In fact, in (Magnini et al., 2002), it has been claimed that knowing the domain of the text in which the word is located is a crucial information for WSD. For example the (domain) polysemy among the COMPUTER SCIENCE and the MEDICINE senses of the word virus can be solved by simply considering the domain of the context in which it is located.</Paragraph>
    <Paragraph position="2"> This fundamental aspect of lexical polysemy can be modeled by defining a kernel function to estimate the domain similarity among the contexts of the words to be disambiguated, namely the Domain Kernel. The Domain Kernel measures the similarity among the topics (domains) of two texts, so to capture domain aspects of sense distinction. It is a variation of the Latent Semantic Kernel (Shawe-Taylor and Cristianini, 2004), in which a DM is exploited to define an explicit mapping D : Rk - Rkprime from the Vector Space Model (Salton and McGill, 1983) into the Domain Space (see Section 4), defined by the following mapping:</Paragraph>
    <Paragraph position="4"> where IIDF is a k x k diagonal matrix such that iIDFi,i = IDF(wi), vectortj is represented as a row vector, and IDF(wi) isthe Inverse Document Frequency of wi. The Domain Kernel is then defined by:</Paragraph>
    <Paragraph position="6"> The final system for WSD results from a combination of kernels that deal with syntagmatic and paradigmatic aspects (i.e. PoS, collocations, bag of words, domains), according to the following kernel  combination schema:</Paragraph>
    <Paragraph position="8"/>
  </Section>
class="xml-element"></Paper>
Download Original XML