File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/w05-0601_metho.xml
Size: 18,209 bytes
Last Modified: 2025-10-06 14:09:53
<?xml version="1.0" standalone="yes"?> <Paper uid="W05-0601"> <Title>Effective use of WordNet semantics via kernel-based learning</Title> <Section position="4" start_page="1" end_page="3" type="metho"> <SectionTitle> 2 Term similarity based on general </SectionTitle> <Paragraph position="0"> knowledge In IR, any similarity metric in the vector space models is driven by lexical matching. When small training material is available, few words can be effectively used and the resulting document similarity metrics may be inaccurate. Semantic generalizations overcome data sparseness problems as contributions from different but semantically similar words are made available.</Paragraph> <Paragraph position="1"> Methods for the induction of semantically inspired word clusters have been widely used in language modeling and lexical acquisition tasks (e.g. (Clark and Weir, 2002)). The resource employed in most works is WordNet (Fellbaum, 1998) which contains three subhierarchies: for nouns, verbs and adjectives. Each hierarchy represents lexicalized concepts (or senses) organized according to an &quot;isa-kind-of &quot; relation. A concept s is described by a set of words syn(s) called synset. The words w [?] syn(s) are synonyms according to the sense s.</Paragraph> <Paragraph position="2"> For example, the words line, argumentation, logical argument and line of reasoning describe a synset which expresses the methodical process of logical reasoning (e.g. &quot;I can't follow your line of reasoning&quot;). Each word/term may be lexically related to more than one synset depending on its senses. The word line is also a member of the synset line, dividing line, demarcation and contrast, as a line denotes also a conceptual separation (e.g. &quot;there is a narrow line between sanity and insanity&quot;). The Wordnet noun hierarchy is a direct acyclic graph1 in which the edges establish the direct isa relations between two synsets.</Paragraph> <Section position="1" start_page="2" end_page="3" type="sub_section"> <SectionTitle> 2.1 The Conceptual Density </SectionTitle> <Paragraph position="0"> The automatic use of WordNet for NLP and IR tasks has proved to be very complex. First, how the topological distance among senses is related to their corresponding conceptual distance is unclear. The pervasive lexical ambiguity is also problematic as it impacts on the measure of conceptual distances between word pairs. Second, the approximation of a set of concepts by means of their generalization in the hierarchy implies a conceptual loss that affects the target IR (or NLP) tasks. For example, black and white are colors but are also chess pieces and this impacts on the similarity score that should be used in IR applications. Methods to solve the above problems attempt to map a priori the terms to specific generalizations levels, i.e. to cuts in the hierarchy (e.g. (Li and Abe, 1998; Resnik, 1997)), and use corpus statistics for weighting the resulting mappings. For several tasks (e.g. in TC) this is unsatisfactory: different contexts of the same corpus (e.g. documents) may require different generalizations of the same word as they independently impact on the document similarity.</Paragraph> <Paragraph position="1"> On the contrary, the Conceptual Density (CD) (Agirre and Rigau, 1996) is a flexible semantic similarity which depends on the generalizations of word senses not referring to any fixed level of the hierarchy. The CD defines a metrics according to the topological structure of WordNet and can be seemingly applied to two or more words. The measure formalized hereafter adapt to word pairs a more general definition given in (Basili et al., 2004).</Paragraph> <Paragraph position="2"> We denote by -s the set of nodes of the hierarchy rooted in the synset s, i.e. {c [?] S|c isa s}, where S is the set of WN synsets. By definition [?]s [?] S,s [?] -s. CD makes a guess about the proximity of the senses, s1 and s2, of two words u1 and u2, according to the information expressed by the minimal subhierarchy, -s, that includes them. Let Si be the set of 1As only the 1% of its nodes own more than one parent in the graph, most of the techniques assume the hierarchy to be a tree, and treat the few exception heuristically.</Paragraph> <Paragraph position="3"> generalizations for at least one sense si of the word</Paragraph> <Paragraph position="5"> (i.e. the common hypernyms) of u1 and u2 * u(-s) is the average number of children per node (i.e. the branching factor) in the sub-hierarchy -s. u(-s) depends on WordNet and in some cases its value can approach 1.</Paragraph> <Paragraph position="6"> * h is the depth of the ideal, i.e. maximally dense, tree with enough leaves to cover the two senses, s1 and s2, according to an average branching factor of u(-s). This value is actually estimated by:</Paragraph> <Paragraph position="8"> When u(s)=1, h ensures a tree with at least 2 nodes to cover s1 and s2 (height = 2).</Paragraph> <Paragraph position="9"> * |-s |is the number of nodes in the sub-hierarchy -s. This value is statically measured on WN and it is a negative bias for the higher level generalizations (i.e. larger -s).</Paragraph> <Paragraph position="10"> CD models the semantic distance as the density of the generalizations s [?] S1 [?]S2. Such density is the ratio between the number of nodes of the ideal tree and |-s|. The ideal tree should (a) link the two senses/nodes s1 and s2 with the minimal number of edges (isa-relations) and (b) maintain the same branching factor (bf ) observed in -s. In other words, this tree provides the minimal number of nodes (and isa-relations) sufficient to connect s1 and s2 according to the topological structure of -s. For example, if -s has a bf of 2 the ideal tree connects the two senses with a single node (their father). If the bf is 1.5, to replicate it, the ideal tree must contain 4 nodes, i.e. the grandfather which has a bf of 1 and the father which has bf of 2 for an average of 1.5. When bf is 1 the Eq. 1 degenerates to the inverse of the number of nodes in the path between s1 and s2, i.e. the simple proximity measure used in (Siolas and d'Alch Buc, 2000).</Paragraph> <Paragraph position="11"> It is worth noting that for each pair CD(u1,u2) determines the similarity according to the closest lexical senses, s1, s2 [?] -s: the remaining senses of u1 and u2 are irrelevant, with a resulting semantic disambiguation side effect. CD has been successfully applied to semantic tagging ((Basili et al., 2004)).</Paragraph> <Paragraph position="12"> As the WN hierarchies for other POS classes (i.e.</Paragraph> <Paragraph position="13"> verb and adjectives) have topological properties different from the noun hyponimy network, their semantics is not suitably captured by Eq. 1. In this paper, Eq. 1 has thus been only applied to noun pairs. As the high number of such pairs increases the computational complexity of the target learning algorithm, efficient approaches are needed. The next section describes how kernel methods can make practical the use of the Conceptual Density in Text Categorization.</Paragraph> </Section> </Section> <Section position="5" start_page="3" end_page="4" type="metho"> <SectionTitle> 3 A WordNet Kernel for document </SectionTitle> <Paragraph position="0"> similarity Term similarities are used to design document similarities which are the core functions of most TC algorithms. The term similarity proposed in Eq. 1 is valid for all term pairs of a target vocabulary and has two main advantages: (1) the relatedness of each term occurring in the first document can be computed against all terms in the second document, i.e. all different pairs of similar (not just identical) tokens can contribute and (2) if we use all term pair contributions in the document similarity we obtain a measure consistent with the term probability distributions, i.e. the sum of all term contributions does not penalize or emphasize arbitrarily any subset of terms. The next section presents more formally the above idea.</Paragraph> <Section position="1" start_page="3" end_page="3" type="sub_section"> <SectionTitle> 3.1 A semantic vector space </SectionTitle> <Paragraph position="0"> Given two documents d1 and d2 [?] D (the documentset) we define their similarity as:</Paragraph> <Paragraph position="2"> where l1 and l2 are the weights of the words (features) w1 and w2 in the documents d1 and d2, respectively and s is a term similarity function, e.g.</Paragraph> <Paragraph position="3"> the conceptual density defined in Section 2. To prove that Eq. 3 is a valid kernel is enough to show that it is a specialization of the general definition of convolution kernels formalized in (Haussler, 1999). Hereafter, we report such definition. Let X,X1,..,Xm be separable metric spaces, x [?] X a structure and vectorx = x1,...,xm its parts, where xi [?] Xi [?]i = 1,..,m. Let R be a relation on the set XxX1x..xXm such that R(vectorx,x) is &quot;true&quot; if vectorx are the parts of x. We indicate with R[?]1(x) the set {vectorx : R(vectorx,x)}. Given two objects x and y [?] X their similarity K(x,y) is defined as:</Paragraph> <Paragraph position="5"> If X defines the document set (i.e. D = X), and X1 the vocabulary of the target document corpus</Paragraph> <Paragraph position="7"> i.e. Eq. 3.</Paragraph> <Paragraph position="8"> The above equation can be used in support vector machines as illustrated by the next section.</Paragraph> </Section> <Section position="2" start_page="3" end_page="4" type="sub_section"> <SectionTitle> 3.2 Support Vector Machines and Kernel </SectionTitle> <Paragraph position="0"> methods Given the vector space in Re and a set of positive and negative points, SVMs classify vectors according to a separating hyperplane, H(vectorx) = vectoro*vectorx+b = 0, wherevectorx and vectoro [?]Re and b [?]Rare learned by applying the Structural Risk Minimization principle (Vapnik, 1995). From the kernel theory we have that:</Paragraph> <Paragraph position="2"> where, d is a classifying document and dh are all the l training instances, projected in vectorx and vectorxh respectively. The product K(d,dh) =<ph(d) * ph(dh)> is the Semantic WN-based Kernel (SK) function associated with the mapping ph.</Paragraph> <Paragraph position="3"> Eq. 5 shows that to evaluate the separating hyperplane inRe we do not need to evaluate the entire vector vectorxh or vectorx. Actually, we do not know even the mapping ph and the number of dimensions, e. As it is sufficient to compute K(d,dh), we can carry out the learning with Eq. 3 in the Rn, avoiding to use the explicit representation in the Re space. The real advantage is that we can consider only the word pairs associated with non-zero weight, i.e. we can use a sparse vector computation. Additionally, to have a uniform score across different document size, the kernel function can be normalized as follows:</Paragraph> <Paragraph position="5"/> </Section> </Section> <Section position="6" start_page="4" end_page="6" type="metho"> <SectionTitle> 4 Experiments </SectionTitle> <Paragraph position="0"> The use of WordNet (WN) in the term similarity function introduces a prior knowledge whose impact on the Semantic Kernel (SK) should be experimentally assessed. The main goal is to compare the traditional Vector Space Model kernel against SK, both within the Support Vector learning algorithm.</Paragraph> <Paragraph position="1"> The high complexity of the SK limits the size of the experiments that we can carry out in a feasible time. Moreover, we are not interested to large collections of training documents as in these training conditions the simple bag-of-words models are in general very effective, i.e. they seems to model well the document similarity needed by the learning algorithms. Thus, we carried out the experiments on small subsets of the 20NewsGroups2 (20NG) and the Reuters-215783 corpora to simulate critical learning conditions.</Paragraph> <Section position="1" start_page="4" end_page="4" type="sub_section"> <SectionTitle> 4.1 Experimental set-up </SectionTitle> <Paragraph position="0"> For the experiments, we used the SVM-light software (Joachims, 1999) (available at svmlight.joachims.org) with the default linear kernel on the token space (adopted as the baseline evaluations). For the SK evaluation we implemented the Eq. 3 with s(*,*) = CD(*,*) (Eq. 1) inside SVM-light. As Eq. 1 is only defined for nouns, a part of speech (POS) tagger has been previously applied. However, also verbs, adjectives and numerical features were included in the pair space.</Paragraph> <Paragraph position="1"> For these tokens a CD = 0 is assigned to pairs made by different strings. As the POS-tagger could introduce errors, in a second experiment, any token with a successful look-up in the WN noun hierarchy was considered in the kernel. This approximation has the benefit to retrieve useful information even for verbs and capture the similarity between verbs and some nouns, e.g. to drive (via the noun drive) has a common synset with parkway.</Paragraph> <Paragraph position="2"> For the evaluations, we applied a careful SVM parameterization: a preliminary investigation suggested that the trade-off (between the training-set error and margin, i.e. c option in SVM-light) parameter optimizes the F1 measure for values in the range [0.02,0.32]4. We noted also that the cost-factor parameter (i.e. j option) is not critical, i.e. a value of 10 always optimizes the accuracy. The feature selection techniques and the weighting schemes were not applied in our experiments as they cannot be accurately estimated from the small available training data.</Paragraph> <Paragraph position="3"> The classification performance was evaluated by means of the F1 measure5 for the single category and the MicroAverage for the final classifier pool (Yang, 1999). Given the high computational complexity of SK we selected 8 categories from the 20NG6 and 8 from the Reuters corpus7 To derive statistically significant results with few training documents, for each corpus, we randomly selected 10 different samples from the 8 categories. We trained the classifiers on one sample, parameterized on a second sample and derived the measures on the other 8. By rotating the training sample we obtained 80 different measures for each model. The size of the samples ranges from 24 to 160 documents depending on the target experiment.</Paragraph> </Section> <Section position="2" start_page="4" end_page="6" type="sub_section"> <SectionTitle> 4.2 Cross validation results </SectionTitle> <Paragraph position="0"> The SK (Eq. 3) was compared with the linear kernel which obtained the best F1 measure in (Joachims, 1999). Table 1 reports the first comparative results for 8 categories of 20NG on 40 training documents.</Paragraph> <Paragraph position="1"> The results are expressed as the Mean and the Std.</Paragraph> <Paragraph position="2"> Dev. over 80 runs. The F1 are reported in Column 2 for the linear kernel, i.e. bow, in Column 3 for SK without applying POS information and in Column 4 for SK with the use of POS information (SK-POS).</Paragraph> <Paragraph position="3"> The last row shows the MicroAverage performance for the above three models on all 8 categories. We note that SK improves bow of 3%, i.e. 34.3% vs.</Paragraph> <Paragraph position="4"> 31.5% and that the POS information reduces the improvement of SK, i.e. 33.5% vs. 34.3%.</Paragraph> <Paragraph position="5"> To verify the hypothesis that WN information is useful in low training data conditions we repeated the evaluation over the 8 categories of Reuters with samples of 24 and 160 documents, respectively. The results reported in Table 2 shows that (1) again SK improves bow (41.7% - 37.2% = 4.5%) and (2) as the number of documents increases the improvement decreases (77.9% - 75.9% = 2%). It is worth noting that the standard deviations tend to assume high values. In general, the use of 10 disjoint training/testing samples produces a higher variability than the n-fold cross validation which insists on the same document set. However, this does not affect the t-student confidence test over the differences between the MicroAverage of SK and bow since the former has a higher accuracy at 99% confidence level.</Paragraph> <Paragraph position="6"> The above findings confirm that SK outperforms the bag-of-words kernel in critical learning conditions as the semantic contribution of the SK recovers useful information. To complete this study we carried out experiments with samples of different size, i.e. 3, 5, 10, 15 and 20 documents for each category. Figures 1 and 2 show the learning curves for 20NG and Reuters corpora. Each point refers to the average on 80 samples.</Paragraph> <Paragraph position="7"> As expected the improvement provided by SK decreases when more training data is available.</Paragraph> <Paragraph position="8"> However, the improvements are not negligible yet.</Paragraph> <Paragraph position="9"> The SK model (without POS information) preserves about 2-3% of improvement with 160 training documents. The matching allowed between noun-verb pairs still captures semantic information which is useful for topic detection. In particular, during the similarity estimation, each word activates 60.05 pairs on average. This is particularly useful to increase the amount of information available to the SVMs.</Paragraph> <Paragraph position="10"> Finally, we carried out some experiments with 160 Reuters documents by discarding the string matching from SK. Only words having different surface forms were allowed to give contributions to the Eq. 3.</Paragraph> <Paragraph position="11"> SK-POS kernels over the 8 categories of 20NewsGroups. The important outcome is that SK converges to a MicroAverage F1 measure of 56.4% (compare with Table 2). This shows that the word similarity provided by WN is still consistent and, although in the worst case, slightly effective for TC: the evidence is that a suitable balancing between lexical ambiguity and topical relatedness is captured by the SVM learning.</Paragraph> </Section> </Section> class="xml-element"></Paper>