File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/w02-1005_metho.xml

Size: 24,472 bytes

Last Modified: 2025-10-06 14:08:01

<?xml version="1.0" standalone="yes"?>
<Paper uid="W02-1005">
  <Title>Augmented Mixture Models for Lexical Disambiguation</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Problem Formulation. Feature Space
</SectionTitle>
    <Paragraph position="0"> The problem of lexical disambiguation can be modeled as a classification task, in which each instance of the word to be disambiguated (target word, henceforth), identified by its context, has to be labeled with one of the established sense labels a0a2a1 a3a5a4a7a6a5a8a9a4a11a10a12a8a14a13a15a13a15a13a15a8a9a4a11a16a18a17 .1 The approaches we investigate are statistical methods a19a21a20a23a22a25a24 a0a27a26 a28a29 a8a31a30a33a32 , outputting conditional probability distributions over the sense set a0 given a context a34a36a35a37a22 . The classification of a context a34 is generally made by choosing  are represented by the confusion set rather than sense labels (for example a58a60a59a62a61a64a63a66a65a7a67a50a68a70a69a55a63a71a65a12a72a31a68a40a73 ). Association for Computational Linguistics.</Paragraph>
    <Paragraph position="1"> Language Processing (EMNLP), Philadelphia, July 2002, pp. 33-40.</Paragraph>
    <Paragraph position="2"> Proceedings of the Conference on Empirical Methods in Natural ... same table as the others but moved into the other bar with my pint and my ...</Paragraph>
    <Paragraph position="3"> Feature type Word POS Lemma  word bar (inventory of 21 senses) and extracted features tive approach in Section 4.1.</Paragraph>
    <Paragraph position="4"> The contexts a22 are represented as a collection of features. Previous work in WSD and CSSC (Golding, 1995; Bruce et al., 1996; Yarowsky, 1996; Golding and Roth, 1996; Pedersen, 1998) has found diverse feature types to be useful, including inflected words, lemmas and part-of-speech (POS) in a variety of collocational and syntactic relationships, including local bigrams and trigrams, predicate-argument relationships, and wide-context bag-of-words associations. Examples of the feature types we employ are illustrated in Figures 1 and 2.</Paragraph>
    <Paragraph position="5"> The syntactic features are intended to capture the predicate-argument relationships in the syntactic window in which the target word occurs.</Paragraph>
    <Paragraph position="6"> Different relations are considered depending on the target word's POS. For nouns, these relations are: verb-object, subject-verb, modifier-noun, and noun-modified_noun; for verbs: verb-object, verbparticle/preposition, verb-prepositional_object; for adjectives: modifying_adjective-noun. Also, words with the same POS as the target word that are linked to the target word by coordinating conjunctions are extracted as sibling features. The extraction process is performed using simple heuristic patterns and regular expressions over the POS environment.</Paragraph>
    <Paragraph position="7"> As Figure 2 shows, we considered for the CSSC task the POS bigrams of the immediate left and right word pairs as additional features in order to solve POS ambiguity and capture more of the syntactic environment in which the target word occurs (the elements of a confusion set often have disjoint or very different syntactic functions).</Paragraph>
    <Paragraph position="8"> ... presents another {piece,peace} of the problem ...</Paragraph>
    <Paragraph position="10"> {piece,peace} and extracted features</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Mixture Models (MM)
</SectionTitle>
    <Paragraph position="0"> We investigate in this Section a direct statistical model that uses the same starting point as the algorithm presented in Walker (1987). We then compare the functionality and the performance of this model to those of the widely used Naive Bayes model for the WSD task (Gale et al., 1992; Mooney, 1996; Pedersen, 1998), enhanced with the full richer feature space beyond the traditional unordered bag-ofwords. null Algorithm 1 Naive Bayes Model</Paragraph>
    <Paragraph position="2"> It is known that Bayes decision rule is optimal if the distribution of the data of each class is known (Duda and Hart, 1973, ch. 2). However, the class-conditional distributions of the data are not known and have to be estimated. Both Naive Bayes and the mixture model we investigated estimate a75 a53 a4a43a76a34 a56 starting from mathematically correct formulations, and thus would be equivalent if the assumptions they make were correct. Naive Bayes makes the assumption (used to transform Equation (1) into (2)) that the features are conditionally independent given a sense label. The mixture model makes a similar assumption, by regarding a document as being completely described by a union of independent features (Equation (3)). In practice, these are not true. Given the strong correlation and common redundancy of the features in the case of WSD-related tasks, in conjunction with the limited training data on which the probabilities are estimated and the high dimensionality of the feature space, these assumptions lead to substantial modeling problems.</Paragraph>
    <Paragraph position="3"> Another important observation is that very many of the frequencies involved in the probability estimation are zero because of the very sparse feature space. Naive Bayes depends heavily on probabilities not being zero and therefore it has to rely on smoothing. On the other hand, the mixture model is more robust to unseen events, without the need for explicit smoothing.</Paragraph>
    <Paragraph position="4"> Under the proposed mixture model, the conditional probability of a sense a4 given a target word a45 in a context a34 is estimated as a mixture of the conditional sense probability distributions for individual context features: Algorithm 2 Mixture Model</Paragraph>
    <Paragraph position="6"> as opposed to the Naive Bayes model in which the probability of a sense a4 given a context a34 is derived from the prior probability of a4 weighted by the conditional probabilities of the contextual features a98a99a53a55a34 a56 given the sense.</Paragraph>
    <Paragraph position="7"> The probabilities a75 a53</Paragraph>
    <Paragraph position="9"> can be computed as maximum likelihood estimates (MLE), by counting the co-occurrences of a4 and a90 versus the occurrences of a90 , respectively a4 in the training data. An extension to this classical estimation method is to use distance-weighted counts instead of raw counts for the relative frequencies:</Paragraph>
    <Paragraph position="11"> a105 a107 denotes the training contexts of word  a56 is computed by raw count. When a90 is a context word, a100 a53a55a90 a8 a34 a56 is computed as a function of the position a119 of the target word a45 in a34 and the positions a120</Paragraph>
    <Paragraph position="13"> timates are obtained. There are various other ways of choosing the weighting measure a123 . One natural way is to transform the distance a76a119a43a124a126a120 a121 a76 into a closeness measure by considering a123 a53a55a119 a8 a120</Paragraph>
    <Paragraph position="15"> (Manning and Schutze, 1999, ch. 14.1). This measure proves to be effective for the spelling correction task, where the words in the immediate vicinity are far more important than the rest of the context words2, but imposes counterproductive differences between the much wider context positions (such as +30 vs. +31) used in WSD, especially when considering large context windows. Experimental results indicate that it is more effective to level out the local positional differences given by a continuous weighting, by instead using weight-equivalent regions which can be described with a simple stepfunction a123 a53a132a120 a8a50a133a134a56 a1</Paragraph>
    <Paragraph position="17"> A filtering process based on the overall importance of a word a90 for the disambiguation of a45 is also employed, using alterations of the form</Paragraph>
    <Paragraph position="19"> a81a154a151a107 proportional to the number of senses of target word a45 which it co-occurs with in the training set.4 In this way, the words that occur only once in the training set, as well as those that occur with most of the senses of a word, providing no relevant information about the sense itself, are penalized.</Paragraph>
    <Paragraph position="20"> Improvements obtained using weighted frequencies and filtering over MLE are shown in Table 1.  of Bayes and Mixture Model as evaluated by 5-fold cross validation on SENSEVAL-2 English data</Paragraph>
    <Paragraph position="22"> mixture model formula (4). When a90 is a word,</Paragraph>
    <Paragraph position="24"> a56 expresses the positional relationship between the occurrences of a90 and the target word a45 in a34 , and is computed using step-functions as described previously. When a90 is a syntactic headword, a75 a53a55a90 a76a34 a56 is chosen as the average value of two ratios expressing the usefulness of the headword type for the given target word and respectively for the POS-class of the target word (adjective, noun, verb). These ratios are estimated by using a jackknife (hold-one-out) procedure on the training set and counting the number times the headword type is a good predictor versus the number of times it is a bad predictor.</Paragraph>
    <Paragraph position="25">  syntactic, collocational and long-distance context features, the probability estimates used by Naive Bayes and MM and their associated weights (a174 ), and the posterior probabilities of the true sense as computed by the two models.</Paragraph>
    <Paragraph position="26"> As shown in Table 1, Bayes and mixture models yield comparable results for the given task. However, they capture the properties of the feature space in distinct ways (example applications of the two models on the sentence in Figure 1 are illustrated in Figure 3) and therefore, are very appropriate to be used together in combination (see Section 5.4).</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Classification Correction and Boosting
</SectionTitle>
    <Paragraph position="0"> We first present an original classification correction method based on the variation of posterior probability estimates across data and then the adaptation of the Adaboost method (Freund and Schapire, 1997) to the task of lexical classification.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 The Maximum Variance Correction
Method (MVC)
</SectionTitle>
      <Paragraph position="0"> One problem arising from the sparseness of training data is that mixture models tend to excessively favor the best represented senses in the training set. A probable cause is that spurious words, which can not be considered general stopwords but do not carry sense-disambiguation information for a particular target word, may occur only by chance both in training and test data.6 Another cause is the fact that mixture models search for decision surfaces linear in the feature space7; therefore, they can not make only correct classifications (unless the feature space can be divided by linear conditions) and the samples for the under-represented senses are likely to be interpreted as outliers.</Paragraph>
      <Paragraph position="1"> To address this estimation problem, a second classification step is employed, based on the observation that the deviation of a component of the posterior distribution from its expected value (as computed over the training set) can be as relevant as the maximum of the distribution a42a44a38a46a45a48a47a50a49a52a51a168a187a75 a53 a4a40a76a34 a56 . Instead of classifying each test context independently after estimating its sense probability distribution, we classify it by comparing it with the whole space of training contexts, for which the posterior distributions are computed using a jackknife procedure.</Paragraph>
      <Paragraph position="2"> Figure 4(a) illustrates such an example: each line in the table represents the posterior distribution over senses given a context, each column contains the values corresponding to a particular sense in the posterior distributions of all contexts. Intuitively, sense a4a7a6 may be preferred to the most likely sense</Paragraph>
      <Paragraph position="4"> the a187a75 a53 a4 a6 a76a34 a6a117a189a64a190 a56 is smaller than a187a75 a53 a4a11a188a60a76a34 a6a117a189a64a190 a56 because of the analogy with a34a33a191a57a53 a38a40a39 a133a9a56 and the &amp;quot;expected values&amp;quot; of the components corresponding to a4 a6 and a4a31a188 .</Paragraph>
      <Paragraph position="5"> Unfortunately, we face again the problem of under-representation in the training data: the expected values in the posterior distributions for the under-represented senses when they express the correct classification can not be accurately estimated.</Paragraph>
      <Paragraph position="6"> Therefore, we have to look at the problem from another angle.</Paragraph>
      <Paragraph position="7">  and given another set a203 a1 a53a55a204a52a205 a56 a205 , the elements of a203 that are least probable as being generated from a201 are those for which the variational coefficients</Paragraph>
      <Paragraph position="9"> To apply this assumption to the disambiguation task, a set a105 a47 containing the values a187a75 a53 a4a40a76a34 a56 for all contexts a34 in the training set that are not labeled a4 is built for every sense a4 (see Figure 4(a)). In this way, the problem of poor representation of some senses is overcome and the selections a105 a47 are large for all senses. An instance in the test set is considered more likely to correspond to a sense a4 if the estimated value a187a75 a53</Paragraph>
      <Paragraph position="11"> didate for having its classification changed to a4 .</Paragraph>
      <Paragraph position="12"> Assuming that the selections a105 a47 are representative and there exist first and second order moments for the underlying distributions (conditions which we call &amp;quot;good statistical properties&amp;quot;), an improvement in the accuracy a30 a124a214a213 of the classifier can be expected when choosing a sense with a variational coefficient a206 a34a196a215</Paragraph>
      <Paragraph position="14"> a sense exists). For example, knowing that the performance of the mixture model for SENSEVAL-2 is 8It is hard to judge how well estimated these statistics are without making any distributional assumptions.</Paragraph>
      <Paragraph position="15"> approximatively a29 a13a218a217a52a219 , the threshold for variational coefficients is set to a30a7a13a218a217a52a220 . Because spurious words not only favor the better represented senses in the training set, but also can affect the variational coefficients of unlikely senses, some restrictions had to be imposed in our implementation to avoid the other extreme of favoring unlikely senses.</Paragraph>
      <Paragraph position="16"> The mixture model does not guarantee the requirements imposed by the MVC method are met, but it has the advantage over the Bayesian model that each of the components of the posterior distribution it computes can be seen as a weighted mixture of random variables corresponding to the individual features. In the simplest case, when considering binary features, these variables are Bernoulli trials. Furthermore, if the trials have the same probability-mass function then a component of the posterior distribution will follow a binomial distribution, and therefore would have good statistical properties. In general, the underlying distributions can not be computed, but our experiments show that they usually have good statistical properties as required by MVC.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 AdaBoost
</SectionTitle>
      <Paragraph position="0"> AdaBoost is an iterative boosting algorithm introduced by Freund and Schapire (1997) shown to be successful for several natural language classification tasks. AdaBoost successively builds classifiers based on a weak learner (base learning algorithm) by weighting differently the examples in the training space, and outputs the final classification by mixing the predictions of the iteratively built classifiers. Because sense disambiguation is a multi-class problem, we chose to use version AdaBoost.M2.</Paragraph>
      <Paragraph position="1"> We could not apply AdaBoost straightforwardly to the problem of sense disambiguation because of the high dimensionality and sparseness of the feature space. Superficial modeling of the training set can easily be achieved because of the singularity/rarity of many feature values in the context space, but this largely represents overfitting of the training data. In order to solve this problem, we use AdaBoost in conjunction with jackknife and a partial updating technique. At each round, a221 classifiers are built using as training all the examples in the training set except the one to be classified, and the weights are updated at feature level rather than context level. This modified Adaboost algorithm could only be implemented for the mixture model, which &amp;quot;perceives&amp;quot; the contexts as additive mixture of features. The Adaboost-enhanced mixture model is called AdaMixt henceforth.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Evaluation
</SectionTitle>
    <Paragraph position="0"> We present a comparative study for four languages (English, Swedish, Spanish, and Basque) by performing 5-fold cross-validation on the SENSEVAL-2 lexical-sample training data, using the fine-grained sense inventory. For English and Swedish, for which POS-tagged training data was available to us, the fnTBL algorithm (Ngai and Florian, 2001) based on Brill (1995) was used to annotate the data, while for Spanish a mildly-supervised POS-tagging system similar to the one presented in Cucerzan and Yarowsky (2000) was employed. We also present the results obtained by the different algorithms on another WSD standard set, SENSEVAL-1, also by performing 5-fold cross validation on the original training data. For CSSC, we tested our system on the identical data from the Brown corpus used by Golding (1995), Golding and Roth (1996) and Mangu and Brill (1997). Finally, we present the results obtained by the investigated methods on a single run on the Senseval-1 and Senseval-2 test data.</Paragraph>
    <Paragraph position="1"> The described models were initially trained and tested by performing 5-fold cross-validation on the SENSEVAL-2 English lexical-sample-task training data. When parameters needed to be estimated, jackknife or a 3-1-1 split (training and/or parameter estimation - testing) were used.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.1 SENSEVAL-2
</SectionTitle>
      <Paragraph position="0"> The English training set for SENSEVAL-2 is composed of 8861 instances representing 73 target words with an average number of 12.5 senses per word. Table 2 illustrates the performance of each of the studied models broken down by part-ofspeech. As observed in most experiments, the feature-enhanced Naive Bayes has the tendency to outperform by a small margin the raw mixture model, but because the latter proved to be boostingfriendly, its augmented versions achieved the highest final accuracies. The difference between MMVC and enhanced Naive Bayes is significant (McNemar rejection risk of a222a126a24 a30 a29</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 English lexical-sample training data
</SectionTitle>
    <Paragraph position="0"> Figure 5 shows both the performance of the mixture model alone and in conjunction with MVC, and highlights the improvement in performance achieved by the latter for each of the 4 languages.</Paragraph>
    <Paragraph position="1"> All MMVC versus MM differences are statistically significant (for SENSEVAL-2 English data, the rejection probability of a paired McNemar test is a30 a29  fold cross validation on SENSEVAL-2 data for 4 languages Figure 6 shows what is generally a log-linear increase in performance of MM alone and in combination with the MVC method over increasing training sizes. Because of the way the smallest training sets were created to include at least one example for each sense, they were more balanced as a side effect, and the compensations introduced by MVC were less productive as a result. Given more training data, MMVC starts to improve relative to the raw model both because the training sets become more unbalanced in their sense distributions and because the empirical moments and the variational coefficients on which the method relies are better estimated.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.2 SENSEVAL-1
</SectionTitle>
      <Paragraph position="0"> The systems used for SENSEVAL-2 English data were also evaluated on the SENSEVAL-1 training  data (30 words, 12479 instances, with an average of 10.8 senses per word) by using 5-fold cross validation. There was no further tuning of the feature space or model parameters to adapt them to the particularities of this new test set. Comparative performance is shown in Table 3. The difference between</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.3 Spelling Correction
</SectionTitle>
      <Paragraph position="0"> Both MM and the enhanced Bayes model obtain virtually the same overall performance9 as the TriBayes system reported in (Golding and Schabes, 1996), which uses a similar feature space. The correction and boosting methods we investigated marginally improve the performance of the mixture model, as can be seen in Table 4 but they do not achieve the performance of RuleS 93.1% (Mangu and Brill, 1997) and Winnow 93.5% (Golding and Roth, 1996; Golding and Roth, 1999), methods that include features more directly specialized for spelling correction. Because of the small size of the test set, the differences in performance are due to only 14 and 20 more incorrectly classified examples respectively. More important than this difference10 may be the fact that the systems built for WSD were able to achieve competitive performance 9All figures reported are for the standard 14 confusion sets; the accuracies for the 18 sets are generally higher.</Paragraph>
      <Paragraph position="1"> 10We did not have the actual classifications from the other systems to check the significance of the difference.</Paragraph>
      <Paragraph position="2"> with little to no adaptation (we only enriched the feature space by adding the POS bigrams to the left and right of the target word and changed the weighting model as presented in Section 3 because spelling correction relies more on the immediate than long-distance context). Another important aspect that can testsize M.L. Bayes MM AdaMixt MMVC  be seen in Table 4 is that there was no model that constantly performed best in all situations, suggesting the advantage of developing a diverse space of models for classifier combination.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.4 Using MMVC in Classifier Combination
</SectionTitle>
      <Paragraph position="0"> The investigated MMVC model proves to be a very effective participant in classifier combination, with substantially different output to Naive Bayes (9.6% averaged complementary rate, as defined in Brill and Wu (1998)). Table 5 shows the improvement obtained by adding the MMVC model to empirically the best voting system we had using Bayes, BayesRatio, TBL and Decision Lists (all classifier combination methods tried and their results are presented exhaustively in Florian and Yarowsky (2002)). The improvement is significant in both cases, as measured by a paired McNemar test: a30a7a13a87a245 a24 a30 a29  fier combination on SENSEVAL-1 and SENSEVAL-2 English as computed by 5-fold cross validation over training data MMVC is also the top performer of the 5 systems mentioned above on SENSEVAL-2 English test data, with an accuracy of 62.5%. Table 6 contrasts the performance obtained by the MMVC method to the average and best system performance in the two  glish test data (only the supervised systems with a coverage of at least 97% were used to compute the mean and variance)</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML