File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/c04-1115_metho.xml
Size: 12,138 bytes
Last Modified: 2025-10-06 14:08:48
<?xml version="1.0" standalone="yes"?> <Paper uid="C04-1115"> <Title>Feature Weighting for Co-occurrence-based Classification of Words</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Feature Weighting Functions </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Mutual Information </SectionTitle> <Paragraph position="0"> Mutual information (MI) is an information-theoretic measure of association between two words, widely used in statistical NLP. Pointwise MI between class c and feature f measures how much information presence of f contains about c:</Paragraph> <Paragraph position="2"/> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Gain Ratio </SectionTitle> <Paragraph position="0"> Gain Ratio (GR) is a normalized variant of Information Gain (IG), introduced into machine learning from information theory (Quinlan, 1993).</Paragraph> <Paragraph position="1"> IG measures the number of bits of information obtained about presence and absence of a class by knowing the presence or absence of the feature1:</Paragraph> <Paragraph position="3"> Gain Ratio aims to overcome one disadvantage of IG which is the fact that IG grows not only with the increase of dependence between f and c, but also with the increase of the entropy of f. That is why features with low entropy receive smaller IG weights although they may be strongly correlated with a class. GR removes this factor by normalizing IG by the entropy of the class:</Paragraph> <Paragraph position="5"/> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.3 Odds Ratio </SectionTitle> <Paragraph position="0"> Odds Ratio (OR) is used in information retrieval to rank documents according to their relevance on the basis of association of their features with a set of positive documents. Mladenic (1998) reports OR to be a particularly successful method of selecting features for text categorization. The OR of a feature f, given the set of positive examples and negative examples for class c, is defined as2:</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.4 Term Strength </SectionTitle> <Paragraph position="0"> Term Strength (TS) was introduced by Wilbur and Sirotkin (1992) for improving efficiency of document retrieval by feature selection. It was later studied in a number of works by Yang and her colleagues (e.g., Yang and Pedersen, 1997), who found that it performs on par with best discriminative functions on the document categorization task. This method is based on the idea that most valuable features are shared by related documents. It defines the weight of a 1 Strictly speaking, the definition does not define IG, but conditional entropy H(c|f) ; the other ingredient of the IG function, the entropy of c, being constant and thus omitted from actual weight calculation.</Paragraph> <Paragraph position="1"> 2 In cases when p(f|c) equals 1 or p(f|c ) equals 0, we mapped the weight to the maximum OR weight in the class. feature as the probability of finding it in some document d given that it has also appeared in the document d', similar to d. To calculate TS for feature f, for each n we first retrieved several related words n' using a distributional similarity measure, thus preparing a set of pairs (n, n'). The TS weight for f was then calculated as the conditional probability of f appearing in n given that f appears also in n' (the ordering of words inside a pair is ignored): )'|()( nfnfPfTS [?][?]= (5) An important parameter in TS is the threshold on the similarity measure used to judge two words to be sufficiently related. Yang and Pedersen determined this threshold by first deciding how many documents can be related to a given one and then finding the average minimum similarity measure for this number of neighbors over all documents in the collection. It should be noted that TS does not make use of the information about feature-class associations and therefore is unsupervised and can be used only for global feature weighting.</Paragraph> <Paragraph position="2"> We introduce two supervised variants of TS, which can be applied locally: TSL1 and TSL2. The first one is different from TS in that, firstly, related words for n are looked for not in the entire training set, but within the class of n; secondly, the weight for a feature is estimated from the distribution of the feature across pairs of members of only that class: c ,with ),'|(),(1 [?][?][?]= n'nnfnfPcfTSL (6) Thus, by weighting features using TSL1 we aim to increase similarity between members of a class and disregard possible similarities across classes. Both TS and TSL1 require computation of similarities between a large set of words and thus incur significant computational costs. We therefore tried another, much more efficient method to identify features characteristic of a class, called TSL2. As TSL1, it looks at how many members of a class share a feature. But instead of computing a set of nearest neighbors for each member, it simply uses all the words in the class as the set of related words. TSL2 is the proportion of instances which possess feature f to the total number of instances in c:</Paragraph> <Paragraph position="4"> Table 1 illustrates the 10 highest scored features according to five supervised functions for the class {ambulance, car, bike, coupe, jeep, motorbike, taxi, truck } (estimated from the BNC co-occurrence data described in Section 4).</Paragraph> <Paragraph position="5"> {ambulance, car, bike, coupe, jeep, motorbike, taxi, truck } according to MI, GR, OR, TSL1, TSL2 The examples vividly demonstrate the basic differences between the functions emphasizing discriminative features vs. those emphasizing characteristic features. The former attribute greatest weights to very rare context words, some of which seem rather informative (knock_by, climb_of, see_into), some also appear to be occasional collocates (remand_to, recover_in ) or parsing mistakes (entrust_to, force_of). In contrast, the latter encourage frequent context words.</Paragraph> <Paragraph position="6"> Among them are those that are intuitively useful (drive, park, get_into), but also those that are too abstract (see, get, take). The inspection of the weights suggests that both feature scoring strategies are able to identify different potentially useful features, but at the same time often attribute great relevance to quite non-informative features. We next describe an empirical evaluation of these functions.</Paragraph> </Section> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Experimental Settings </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Data </SectionTitle> <Paragraph position="0"> The evaluation was carried out on the task of classifying English nouns into predefined semantic classes. The meaning of each noun n,N was represented by a vector where features are verbs v,V with which the nouns are used as either direct or prepositional objects. The values of the features were conditional probabilities p(v|n). Two different datasets were used in the experiments: verb-noun co-occurrence pairs extracted from the British National Corpus (BNC)3 and from the Associated Press 1988 corpus (AP)4. Rare nouns were filtered: the BNC data contained nouns that appeared with with at least 19 different verbs. Co-occurrences that appeared only once were removed.</Paragraph> <Paragraph position="1"> To provide the extracted nouns with class labels needed for training and evaluation, the nouns were arranged into classes using WordNet in the following manner. Each class was made up of those nouns whose most frequent senses are hyponyms to a node seven edges below the root level of WordNet. Only those classes were used in the study that had 5 or more members. Thus, from the BNC data we formed 60 classes with 514 nouns and from the AP data 42 classes with 375 nouns.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Classification algorithms </SectionTitle> <Paragraph position="0"> Two classification algorithms were used in the study: k nearest neighbors (kNN) and Naive Bayes, which were previously shown to be quite robust on highly dimensional representations on tasks including word classification (e.g., Ciaramita 2002).</Paragraph> <Paragraph position="1"> The kNN algorithm classifies a test instance by first identifying its k nearest neighbors among the training instances according to some similarity measure and then assigning it to the class that has the majority in the set of nearest neighbors. We used the weighted kNN algorithm: the vote of each neighbor was weighted by the score of its similarity to the test instance.</Paragraph> <Paragraph position="2"> As is well known, kNN's performance is highly sensitive to the choice of the similarity metric.</Paragraph> <Paragraph position="3"> Therefore, we experimented with several similarity metrics and found that on both datasets Jensen-Shannon Divergence yields the best classification results (see Table 1). Incidentally, this is in accordance with a study by (Dagan et al., 1997) who found that it consistently performed better than a number of other popular functions.</Paragraph> <Paragraph position="4"> the kNN algorithm.</Paragraph> <Paragraph position="5"> Jensen-Shannon Divergence measures the (dis)similarity between a train instance n and test instance m as: )]||()||([21),( ,, mnmn avgmDavgnDmnJ += (8) where D is the Kullback Leibler divergence between two probability distributions x and y: [?] [?]= Vv yvp xvpxvpyxD )|( )|(log)|()||( (9) and avgn,m is the average of the distributions of n and m.</Paragraph> <Paragraph position="6"> In testing each weighting method, we experimented with k = 1, 3, 5, 7, 10, 15, 20, 30, 50, 70, and 100 in order to take into account the fact that feature weighting typically changes the optimal value of k. The results for kNN reported below indicate the highest effectiveness measures obtained among all k in a particular test.</Paragraph> <Paragraph position="7"> The Naive Bayes algorithm classifies a test instance m by finding a class c that maximizes p(c|Vm[?]m). Assuming independence between features, the goal of the algorithm can be stated as:</Paragraph> <Paragraph position="9"> where p(ci) and p(v|ci) are estimated during the training process from the corpus data.</Paragraph> <Paragraph position="10"> The Naive Bayes classifier adopted in the study was the binary independence model, which estimates p(v|ci) assuming the binomial distribution of features across classes. In order to introduce the information inherent in the frequencies of features into the model all input probabilities were calculated from the real values of features, as suggested in (Lewis, 1998).</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.3 Evaluation method </SectionTitle> <Paragraph position="0"> To evaluate the quality of classifications, we adopted the ten-fold cross-validation technique. The same 10 test-train splits were used in all experiments.</Paragraph> <Paragraph position="1"> Since we found that the difficulty of particular test sets can vary quite a lot, using the same test-train splits allowed for estimation of the statistical significance of differences between the results of particular methods (one-tailed paired t-test was used for this purpose). Effectiveness was first measured in terms of precision and recall, which were then used to compute the Fb score5. The reported evaluation measure is microaveraged F scores.</Paragraph> <Paragraph position="2"> As a baseline, we used the k-nn and the Naive Bayes classifiers trained and tested on non-weighted instances.</Paragraph> </Section> </Section> class="xml-element"></Paper>