File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/97/a97-1056_intro.xml

Size: 4,141 bytes

Last Modified: 2025-10-06 14:06:15

<?xml version="1.0" standalone="yes"?>
<Paper uid="A97-1056">
  <Title>Sequential Model Selection for Word Sense Disambiguation *</Title>
  <Section position="3" start_page="0" end_page="388" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> In this paper word-sense disambiguation is cast as a problem in supervised learning, where a classifier is induced from a corpus of sense-tagged text. Suppose there is a training sample where each sense-tagged sentence is represented by the feature variables (F1,..., Fn-1, S). Selected contextual properties of the sentence are represented by (F1, * *., Fn-1) and the sense of the ambiguous word is represented by S. Our task is to induce a classifier that will predict the value of S given an untagged sentence represented by the contextual feature variables.</Paragraph>
    <Paragraph position="1"> We adopt a statistical approach whereby a probabilistic model is selected that describes the interactions among the feature variables. Such a model can form the basis of a probabilistic classifier since it specifies the probability of observing any and all combinations of the values of the feature variables.</Paragraph>
    <Paragraph position="2"> Suppose our training sample has N sense-tagged sentences. There are q possible combinations of values for the n feature variables, where each such combination is represented by a feature vector. Let * This research was supported by the Office of Naval Research under grant number N00014-95-1-0776.</Paragraph>
    <Paragraph position="3"> fi and Oi be the frequency and probability of observing the i th feature vector, respectively. Then (fl,..., fq) has a multinomial distribution with parameters (N, 81,..., 8q). The 0 parameters, i.e., the joint parameters, define the joint probability distribution of the feature variables. These are the parameters of the fully saturated model, the model in which the value of each variable directly affects the values of all the other variables. These parameters can be estimated as maximum likelihood estimates (MLEs), such that the estimate of 8i, ~/, is ~.</Paragraph>
    <Paragraph position="4"> For these estimates to be reliable, each of the q possible combinations of feature values must occur in the training sample. This is unlikely for NLP data samples, which are often sparse and highly skewed (c.f., e.g. (Pedersen et al., 1996) and (Zipf, 1935)).</Paragraph>
    <Paragraph position="5"> However, if the data sample can be adequately characterized by a less complex model, i.e., a model in which there are fewer interactions between variables, then more reliable parameter estimates can be obtained: In the case of decomposable models (Darroch et al., 1980; see below), the parameters of a less complex model are parameters of marginal distributions, so the MLEs involve frequencies of combinations of values of only subsets of the variables in the model. How well a model characterizes the training sample is determined by measuring the fit of the model to the sample, i.e., how well the distribution defined by the model matches the distribution observed in the training sample.</Paragraph>
    <Paragraph position="6"> A good strategy for developing probabilistic classifters is to perform an explicit model search to select the model to use in classification. This paper presents the results of a comparative study of search strategies and evaluation criteria for measuring model fit. We restrict the selection process to the class of decomposable models (Darroch et al., 1980), since restricting model search to this class has many computational advantages.</Paragraph>
    <Paragraph position="7"> We begin with a short description of decomposable models (in section 2). Search strategies (in section 3) and model evaluation (in section 4) are described next, followed by the results of an extensive disambiguation experiment involving 12 ambiguous  words (in sections 5 and 6). We discuss related work (in section 7) and close with recommendations for search strategy and evaluation criterion when selecting models for word-sense disambiguation.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML