File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/a97-1056_metho.xml

Size: 20,606 bytes

Last Modified: 2025-10-06 14:14:31

<?xml version="1.0" standalone="yes"?>
<Paper uid="A97-1056">
  <Title>Sequential Model Selection for Word Sense Disambiguation *</Title>
  <Section position="4" start_page="388" end_page="388" type="metho">
    <SectionTitle>
2 Decomposable Models
</SectionTitle>
    <Paragraph position="0"> Decomposable models are a subset of the class of graphical models (Whittaker, 1990) which are in turn a subset of the class of log-linear models (Bishop et al., 1975). Familiar examples of decomposable models are Naive Bayes and n-gram models.</Paragraph>
    <Paragraph position="1"> They are characterized by the following properties (Bruce and Wiebe, 1994b): 1. In a graphical model, variables are either inter-dependent or conditionally independent of one another. 1 All graphical models have a graphical representation such that each variable in the model is mapped to a node in the graph, and there is an undirected edge between each pair of nodes corresponding to interdependent variables. The sets of completely connected nodes (i.e., cliques) correspond to sets of interdependent variables. Any two nodes that are not directly connected by an edge are conditionally independent given the values of the nodes on the path that connects them.</Paragraph>
    <Paragraph position="2"> 2. Decomposable models are those graphical models that express the joint distribution as the product of the marginal distributions of the variables in the maximal cliques of the graphical representation, scaled by the marginal distributions of variables common to two or more of these maximal sets. Because their joint distributions have such closed-form expressions, the parameters can be estimated directly from the training data without the need for an iterative fitting procedure (as is required, for example, to estimate the parameters of maximum entropy models; (Berger et al., 1996)).</Paragraph>
    <Paragraph position="3"> 3. Although there are far fewer decomposable models than log-linear models for a given set of feature variables, it has been shown that they have substantially the same expressive power (Whittaker, 1990).</Paragraph>
    <Paragraph position="4"> The joint parameter estimate &amp;quot;d Fl'F2'F3's \]~,\]~..f3,~, is the probability that the feature vector (fl, f~., .1:3, si) will be observed in a training sample where each observation is represented by the feature variables (F1, F~, F3, S). Suppose that the graphical representation of a decomposable model is defined by the two cliques (i.e., marginals) (F1, S) and (F2, F3, S). The frequencies of these marginals, f(F1 = fl, S = si) and f(F2 = f2,F3 = f3,S = si), are sufficient statistics in that they provide enough information</Paragraph>
    <Paragraph position="6"> to calculate maximum likelihood estimates of the model parameters. MLEs of the model parameters are simply the marginal frequencies normalized by the sample size N. The joint parameter estimate is formulated as a normalized product:</Paragraph>
    <Paragraph position="8"> Rather than having to observe the complete feature vector (ft, f.~, f3, si) in the training sample to estimate the joint parameter, it is only necessary to observe the marginals (ft, si) and (f2, f3, si).</Paragraph>
  </Section>
  <Section position="5" start_page="388" end_page="389" type="metho">
    <SectionTitle>
3 Model Search Strategies
</SectionTitle>
    <Paragraph position="0"> The search strategies presented in this paper are backward sequential search (BSS) and forward sequential search (FSS). Sequential searches evaluate models of increasing (FSS) or decreasing (BSS) levels of complexity, where complexity is defined by the number of interactions among the feature variables (i.e., the number of edges in the graphical representation of the model).</Paragraph>
    <Paragraph position="1"> A backward sequential search (BSS) begins by designating the saturated model as the current model. A saturated model has complexity level i = n(n-t) where n is the number of feature vari- 2 ables. At each stage in BSS we generate the set of decomposable models of complexity level i - 1 that can be created by removing an edge from the current model of complexity level i. Each member of this set is a hypothesized model and is judged by the evaluation criterion to determine which model results in the least degradation in fit from the current model--that model becomes the current model and the search continues. The search stops when either (1) every hypothesized model results in an unacceptably high degradation in fit or (2) the current model has a complexity level of zero.</Paragraph>
    <Paragraph position="2"> A forward sequential search (FSS) begins by designating the model of independence as the current model. The model of independence has complexity level i = 0 since there are no interactions among the feature variables. At each stage in FSS we generate the set of decomposable models of complexity level i + 1 that can be created by adding an edge to the current model of complexity level i. Each member of this set is a hypothesized model and is judged by the evaluation criterion to determine which model results in the greatest improvement in fit from the current model--that model becomes the current model and the search continues. The search stops when either (1) every hypothesized model results in an unacceptably small increase in fit or (2) the current model is saturated.</Paragraph>
    <Paragraph position="3"> For sparse samples FSS is a natural choice since early in the search the models are of low complexity.</Paragraph>
    <Paragraph position="4">  The number of model parameters is small and they have more reliable estimated values. On the other hand, BSS begins with a saturated model whose parameter estimates are known to be unreliable.</Paragraph>
    <Paragraph position="5"> During both BSS and FSS, model selection also performs feature selection. If a model is selected where there is no edge connecting a feature variable to the classification variable then that feature is not relevant to the classification being performed.</Paragraph>
  </Section>
  <Section position="6" start_page="389" end_page="389" type="metho">
    <SectionTitle>
4 Model Evaluation Criteria
</SectionTitle>
    <Paragraph position="0"> Evaluation criteria fall into two broad classes, significance tests and information criteria. This paper considers two significance tests, the exact conditional test (Kreiner, 1987) and the Log-likelihood ratio statistic G 2 (Bishop et al., 1975), and two information criteria, Akaike's Information Criterion (AIC) (Akaike, 1974) and the Bayesian Information Criterion (BIC) (Schwarz, 1978).</Paragraph>
    <Section position="1" start_page="389" end_page="389" type="sub_section">
      <SectionTitle>
4.1 Significance tests
</SectionTitle>
      <Paragraph position="0"> The Log-likelihood ratio statistic G 2 is defined as:</Paragraph>
      <Paragraph position="2"> where fi and ei are the observed and expected counts of the i th feature vector, respectively. The observed count fi is simply the frequency in the training sample. The expected count ei is calculated from the frequencies in the training data assuming that the hypothesized model, i.e., the model generated in the search, adequately fits the sample. The smaller the value of G 2 the better the fit of the hypothesized model.</Paragraph>
      <Paragraph position="3"> The distribution of G 2 is asymptotically approximated by the X 2 distribution (G 2 ,,~ X 2) with adjusted degrees of freedom (dof) equal to the number of model parameters that have non-zero estimates given the training sample. The significance of a model is equal to the probability of observing its reference G ~ in the X 2 distribution with appropriate dof. A hypothesized model is accepted if the significance (i.e., probability) of its reference G ~ value is greater than, in the case of FSS, or less than, in the case of BSS, some pre-determined cutoff, a.</Paragraph>
      <Paragraph position="4"> An alternative to using a X 2 approximation is to define the exact conditional distribution of G 2. The exact conditional distribution of G 2 is the distribu: tion of G ~ values that would be observed for comparable data samples randomly generated from the model being tested. The significance of G 2 based on the exact conditional distribution does not rely on an asymptotic approximation and is accurate for sparse and skewed data samples (Pedersen et al., 1996)</Paragraph>
    </Section>
    <Section position="2" start_page="389" end_page="389" type="sub_section">
      <SectionTitle>
4.2 Information criteria
</SectionTitle>
      <Paragraph position="0"> The family of model evaluation criteria known as information criteria have the following expression:</Paragraph>
      <Paragraph position="2"> where G ~ and dof are defined above. Members of this family are distinguished by their different values of ~. AIC corresponds to g = 2. BIC corresponds to ~ = log(N), where N is the sample size.</Paragraph>
      <Paragraph position="3"> The various information criteria are an alternative to using a pre-defined significance level (a) to judge the acceptability of a model. AIC and BIC reward good model fit and penalize models with large numbers of parameters. The parameter penalty is expressed as ~ x do f, where the size of the penalty is the adjusted degrees of freedom, and the weight of the penalty is controlled by x.</Paragraph>
      <Paragraph position="4"> During BSS the hypothesized model with the largest negative IC,~ value is selected as the current model of complexity level i - 1, while during FSS the hypothesized model with the largest positive IC,~ value is selected as the current model of complexity level i + 1. The search stops when the IC,~ values for all hypothesized models are greater than zero in the case of BSS, or less than zero in the case of FSS.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="389" end_page="390" type="metho">
    <SectionTitle>
5 Experimental Data
</SectionTitle>
    <Paragraph position="0"> The sense-tagged text and feature set used in these experiments are the same as in (Bruce et al., 1996). The text consists of every sentence from the ACL/DCI Wall Street Journal corpus that contains any of the nouns interest, bill, concern, and drug, any of the verbs close, help, agree, and include, or any of the adjectives chief, public, last, and common.</Paragraph>
    <Paragraph position="1"> The extracted sentences have been hand-tagged with senses defined in the Longman Dictionary of Contemporary English (LDOCE). There are between 800 and 3,000 sense-tagged sentences for each of the 12 words. This data was randomly divided into training and test samples at a 10:1 ratio.</Paragraph>
    <Paragraph position="2"> A sentence with an ambiguous word is represented by a feature set with three types of contextual feature variables: 2 (1) The morphological feature (E) indicates if an ambiguous noun is plural or not. For verbs it indicates the tense of the verb. This feature is not used for adjectives. (2) The POS features have one of 25 possible POS tags, derived from the first letter of the tags in the ACL/DCI WSJ corpus. There are four POS feature variables representing the POS of the two words immediately preceding (L1, L2) and following (R1, R2) the ambiguous word. (3) The three binary collocation-specific features (C1, C2, Ca) indicate ifa particular word occurs in a sentence with an ambiguous word.</Paragraph>
    <Paragraph position="3">  The sparse nature of our data can be illustrated by interest. There are 6 possible values for the sense variable. Combined with the other feature variables this results in 37,500,000 possible feature vectors (or joint parameters). However, we have a training sample of only 2,100 instances.</Paragraph>
  </Section>
  <Section position="8" start_page="390" end_page="392" type="metho">
    <SectionTitle>
6 Experimental Results
</SectionTitle>
    <Paragraph position="0"> In total, eight different decomposable models were selected via a model search for each of the 12 words.</Paragraph>
    <Paragraph position="1"> Each of the eight models is due to a different combination of search strategy and evaluation criterion.</Paragraph>
    <Paragraph position="2"> Two additional classifiers were evaluated to serve as benchmarks. The default classifier assigns every instance of an ambiguous word with its most frequent sense in the training sample. The Naive Bayes classifier uses a model that assumes that each contextual feature variable is conditionally independent of all other contextual variables given the value of the sense variable.</Paragraph>
    <Section position="1" start_page="390" end_page="390" type="sub_section">
      <SectionTitle>
6.1 Accuracy comparison
</SectionTitle>
      <Paragraph position="0"> The accuracy 3 of each of these classifiers for each of the 12 words is shown in Figure 1. The highest accuracy for each word is in bold type while any accuracies less than the default classifier are italicized. The complexity of the model selected is shown in parenthesis. For convenience, we refer to model selection using, for example, a search strategy of FSS and the evaluation criterion AIC as FSS AIC.</Paragraph>
      <Paragraph position="1"> Overall AIC selects the most accurate models during both BSS and FSS. BSS AIC finds the most accurate model for 6 of 12 words while FSS AIC finds the most accurate for 4 of 12 words. BSS BIC and the Naive Bayes find the most accurate model for 3 of 12 words. Each of the other combinations finds the most most accurate model for 2 of 12 words except for FSS exact conditional which never finds the most accurate model.</Paragraph>
      <Paragraph position="2"> Neither AIC nor BIC ever selects a model that results in accuracy less than the default classifier.</Paragraph>
      <Paragraph position="3"> However, FSS exact conditional has accuracy less than the default for 6 of 12 words and BSS exact conditional has accuracy less than the default for 3 of 12 words. BSS G 2-~X 2 and FSS G 2,,~ X 2 have less than default accuracy for 2 of 12 and 1 of 12 words, respectively.</Paragraph>
      <Paragraph position="4"> The accuracy of the significance tests vary greatly depending on the choice of c~. Of the various (~ values that were tested, .01, .05, .001, and .0001, the value of .0001 was found to produce the most accurate models. Other values of c~ will certainly led to other results. The information criteria do not require the setting of any such cut-off values.</Paragraph>
      <Paragraph position="5"> A low complexity model that results in high accuracy disambiguation is the ultimate goal. Figure 1 3The percentage of ambiguous words in a held out test sample that are disambiguated correctly.</Paragraph>
      <Paragraph position="6"> shows that BIC and G 2 ,-~ X 2 select lower complexity models than either AIC or the exact conditional test.</Paragraph>
      <Paragraph position="7"> However, both appear to sacrifice accuracy when compared to AIC. BIC assesses a greater parameter penalty (~ = log(N)) than does AIC (~ = 2), causing BSS BIC to remove more interactions than BSS AIC. Likewise, FSS BIC adds fewer interactions than FSS AIC. In both cases BIC selects models whose complexity is too low and adversely affects accuracy when compared to AIC.</Paragraph>
      <Paragraph position="8"> The Naive Bayes classifier achieves a high level of accuracy using a model of low complexity. In fact, while the Naive Bayes classifier is most accurate for only 3 of the 12 words, the average accuracy of the Naive Bayes classifiers for all 12 words is higher than the average classification accuracy resulting from any combination of the search strategies and evaluation criteria. The average complexity of the Naive Bayes models is also lower than the average complexity of the models resulting from any combination of the search strategies and evaluation criteria except BSS BIC and FSS BIC.</Paragraph>
    </Section>
    <Section position="2" start_page="390" end_page="390" type="sub_section">
      <SectionTitle>
6.2 Search strategy and accuracy
</SectionTitle>
      <Paragraph position="0"> An evaluation criterion that finds models of similar accuracy using either BSS or FSS is to be preferred over one that does not. Overall the information criteria are not greatly affected by a change in the search strategy, as illustrated in Figure 3.</Paragraph>
      <Paragraph position="1"> Each point on this plot represents the accuracy of the models selected for a word by the same evaluation criterion using BSS and FSS. If this point falls close to the line BSS = FSS then there is little or no difference between the accuracy of the models selected during FSS and BSS.</Paragraph>
      <Paragraph position="2"> AIC exhibits only minor deviation from BSS = FSS. This is also illustrated by the fact that the average accuracy between BSS AIC and FSS AIC only differs by .0013. The significance tests, especially the exact conditional, are more affected by the search strategy. It is clear that BSS exact conditional is much more accurate than FSS exact conditional. FSS G 2 -~ X 2 is slightly more accurate than BSS G 2 ,-, X ~.</Paragraph>
    </Section>
    <Section position="3" start_page="390" end_page="392" type="sub_section">
      <SectionTitle>
6.3 Feature selection: interest
</SectionTitle>
      <Paragraph position="0"> Figure 2 shows the models selected by the various combinations of search strategy and evaluation criterion for interest.</Paragraph>
      <Paragraph position="1"> During BSS, AIC removed feature L2 from the model, BIC removed L1,L2, R1 and R2, G 2 &amp;quot;-, X 2 removed no features, and the exact conditional test removed C2. During FSS, AIC never added R2, BIC never added C1, C3, L1, L~ and R~, and G ~ ~, X 2 and the exact conditional test added all the features.</Paragraph>
      <Paragraph position="2"> G 2 ~ X 2 is the most consistent of the evaluation criteria in feature selection. During both BSS and FSS it found that all the features were relevant to classification.</Paragraph>
      <Paragraph position="4"> AIC found seven features to be relevant in both BSS and FSS. When using AIC, the only difference in the feature set selected during FSS as compared to that selected during BSS is the part of speech feature that is found to be irrelevant: during BSS L2 is removed and during FSS R2 is never added. All other criteria exhibit more variation between FSS and BSS in feature set selection.</Paragraph>
    </Section>
    <Section position="4" start_page="392" end_page="392" type="sub_section">
      <SectionTitle>
6.4 Model selection: interest
</SectionTitle>
      <Paragraph position="0"> Here we consider the results of each stage of the sequential model selection for interest. Figures 4 through 7 show the accuracy and recall 4 for the best fitting model at each level of complexity in the search. The rightmost point on each plot for each evaluation criterion is the measure associated with the model ultimately selected.</Paragraph>
      <Paragraph position="1"> These plots illustrate that BSS BIC selects models of too low complexity. In Figure 4 BSS BIC has &amp;quot;gone past&amp;quot; much more accurate models than the one it selected. We observe the related problem for FSS BIC. In Figure 6 FSS BIC adds too few interactions and does not select as accurate a model as FSS AIC. The exact conditional test suffers from the reverse problem of BIC. BSS exact conditional removes only a few interactions while FSS exact conditional adds many interactions, and in both cases the resulting models have poor accuracy.</Paragraph>
      <Paragraph position="2"> The difference between BSS and FSS is clearly il4The percentage of ambiguous words in a held out test sample that are disambiguated, correctly or not. A word is not disambiguated if the model parameters needed to assign a sense tag cannot be estimated from the training sample.</Paragraph>
      <Paragraph position="3"> lustrated by these plots. AIC and BIC eliminate interactions that have high dof's (and thus have large numbers of parameters) much earlier in BSS than the significance tests. This rapid reduction in the number of parameters results in a rapid increases in accuracy (Figure 4) and recall for AIC and BIC (Figure 5) relative to the significance tests as they produce models with smaller numbers of parameters that can be estimated more reliably.</Paragraph>
      <Paragraph position="4"> However, during the early stages of FSS the number of parameters in the models is very small and the differences between the information criteria and the significance tests are minimized. The major difference among the criteria in Figures 6 and 7 is that the exact conditional test adds many more interactions.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML