File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/c04-1115_intro.xml

Size: 6,473 bytes

Last Modified: 2025-10-06 14:02:10

<?xml version="1.0" standalone="yes"?>
<Paper uid="C04-1115">
  <Title>Feature Weighting for Co-occurrence-based Classification of Words</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Lexical repositories like thesauri and lexicons are today a key component of many NLP technologies, where they serve as background knowledge for processing the semantics of text. But, as is well known, manual compilation of such resources is a very costly procedure, and their automated construction is an important research issue.</Paragraph>
    <Paragraph position="1"> One promising possibility to speed up the lexical acquisition process is to glean the semantics of words from a corpus by adopting the co-occurrence model of word meaning. Previous research has investigated a wide range of its applications, including automatic construction of thesauri, their enrichment, acquisition of bilingual lexicons, learning of information extraction patterns, named entity classification and others..</Paragraph>
    <Paragraph position="2"> The basic idea behind the approach is that the distribution of a word across lexical contexts (other words and phrases it co-occurs with) is highly indicative of its meaning. The method represents the meaning of a word as a vector where each feature corresponds to a context and its value to the frequency of the word's occurring in that context.</Paragraph>
    <Paragraph position="3"> Once such representation is built, machine learning techniques can be used to perform various lexical acquisition tasks, e.g. automatically classify or cluster words according to their meaning.</Paragraph>
    <Paragraph position="4"> However, using natural language words as features inevitably results in very noisy .* The study was partially supported by the Russian Foundation Basic Research grant #03-06-80008.</Paragraph>
    <Paragraph position="5"> representations. Because of their inherent polysemy and synonymy, many context words become ambiguous or redundant features. It is therefore desirable to determine a measure of usefulness of each feature and weight it accordingly. Still, despite a wide variety of feature weighting methods existing in machine learning, these methods are poorly explored in application to lexical acquisition. There have been a few studies (e.g., Lin, 1998; Ciaramita, 2002; Alfonseca and Manandhar, 2002) where word representations are modified through this or that kind of feature weighting. But in these studies it is performed only as a standard pre-processing step on the analogy with similar tasks like text categorization, and the choice of a particular weighting procedure is seldom motivated. To our knowledge, there is no work yet on evaluation and comparison of different weighting methods for lexical acquisition.</Paragraph>
    <Paragraph position="6"> The goal of this paper is to comparatively study a number of popular feature weighting methods in application to the task of word classification.</Paragraph>
    <Paragraph position="7"> The structure of the paper is the following.</Paragraph>
    <Paragraph position="8"> Section 2 more formally describes the task of feature weighting. Section 3 describes the weighting methods under study. Section 4 details the experimental data, classification algorithms used, and evaluation methods. Section 5 is concerned with the results of the experiments and their discussion. Section 6 presents conclusions from the study.</Paragraph>
    <Paragraph position="9"> 2 Two feature weighting strategies In machine learning, feature weighting before classification is performed with the purpose to reflect how much particular features reveal about class membership of instances. The weights of features are determined from their distribution across training classes, which is why the weighting procedure can be called supervised. In the context of word classification this procedure can be formalized as follows.</Paragraph>
    <Paragraph position="10"> Let us assume that each word n[?]N of the training set is represented as a feature vector, consisting of features f [?] F, and that each n is assigned a class label c[?]C, i.e. [?]n[?]c[?]C: n,c. For each f, from its distribution across C, a certain function computes its relevance score, specific to each class. This score can be used directly as its local weight w(f,c). Alternatively, from class-specific weights of a feature, one can compute its single global weight, using some globalization policy. For example, as a global weight one can use the maximum local weight of f across all classes wglob(f)= ),(max cfwCc[?] . After the weights have been applied to the training data, a classifier is learned and evaluated on the test data.</Paragraph>
    <Paragraph position="11"> A key decision in the weighting procedure is to choose a function computing w(f,c). Such functions typically try to capture the intuition that the best features for a class are the ones that best discriminate the sets of its positive and negative examples. They determine w(f,c) from the distribution of f between c and c , attributing greater weights to those f that correlate with c or c most. In the present study we include three such functions widely used in text categorization: mutual information, information gain ratio and odds ratio.</Paragraph>
    <Paragraph position="12"> There is another view on feature scoring that it is sometimes adopted in classification tasks.</Paragraph>
    <Paragraph position="13"> According to this view, useful are those features that are shared by the largest number of positive examples of c. The purpose of emphasizing these features is to characterize the class without necessarily discriminating it from other classes.</Paragraph>
    <Paragraph position="14"> Functions embodying this view assess w(f,c) from the distribution of f across n , c, giving greater weight to those f that are distributed most uniformly. Although they do not explicitly aim at underpinning differences between classes, these functions were shown to enhance text retrieval (Wilbur and Sirotkin, 1992) and text categorization (Yang and Pedersen, 1997). In this paper we experimented with term strength, a feature scoring function previously shown to be quite competitive in information retrieval. Since term strength is an unsupervised function, we develop two supervised variants of it tailoring them for the classification task.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML