File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-2405_metho.xml

Size: 24,051 bytes

Last Modified: 2025-10-06 14:09:23

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-2405">
  <Title>Co-training and Self-training for Word Sense Disambiguation</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Supervised Word Sense Disambiguation
</SectionTitle>
    <Paragraph position="0"> Supervised word sense disambiguation systems work under the assumption that several annotated examples are available for a target ambiguous word. These examples are used to build a classifier that automatically learns clues useful for the disambiguation of the given polysemous word, and then applies these clues to the classification of new unlabeled instances.</Paragraph>
    <Paragraph position="1"> First, the examples are pre-processed and annotated with morphological or syntactic tags. Next, each sense-tagged example is transformed into a feature vector, suitable for an automatic learning process. There are two main decisions that one takes in the construction of such a classifier: (1) What features to extract from the examples provided, to best model the behavior of the given ambiguous word; (2) What learning algorithm to use for best performance.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Preprocessing
</SectionTitle>
      <Paragraph position="0"> During preprocessing, SGML tags are eliminated, the text is tokenized and annotated with parts of speech. Collocations are identified using a sliding window approach, where a collocation is considered to be a sequence of words that forms a compound concept defined in Word-Net. During this process, all collocations that include the target word are identified, and the examples that use a collocation are removed from the training/test data. For instance, examples referring to short circuit are removed from the data set for circuit, so that a separate learning process is performed for each lexical unit.</Paragraph>
      <Paragraph position="1">  determined for each sense of the ambiguous word. The value of this feature is either 0 or 1, depending if the current example contains one of the determined keywords or not.</Paragraph>
      <Paragraph position="2"> B (T) Maximum of M bigrams occurring at least N times are determined for all training examples. The value of this feature is either 0 or 1, depending if the current example contains one of the determined bigrams or not. Bigrams are ordered using the Dice coefficient</Paragraph>
      <Paragraph position="4"> biguation. a3a5a4 denotes the current (ambiguous) word.</Paragraph>
      <Paragraph position="5"> Feature type is indicated as local (L) or topical (T).</Paragraph>
      <Paragraph position="6"> 3.2 Features that are good indicators of word sense Previous work on word sense disambiguation has acknowledged several local and topical features as good indicators of word sense. These include surrounding words and their part of speech tags, collocations, keywords in contexts. More recently, other possible features have been investigated: bigrams, named entities, syntactic features, semantic relations with other words in context. Table 1 lists commonly used features in word sense disambiguation (list drawn from a larger set of features compiled by (Mihalcea, 2002)).</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 Supervised learning for word sense
</SectionTitle>
      <Paragraph position="0"> disambiguation Related work in supervised word sense disambiguations includes experiments with a variety of learning algorithms, with varying degrees of success, including Bayesian learning, decision trees, decision lists, memory based learning, and others. (Yarowsky and Florian, 2002) give a comprehensive examination of learning methods and their combination.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.4 Basic Classifiers for Word Sense Disambiguation
</SectionTitle>
      <Paragraph position="0"> Several basic word sense disambiguation classifiers can be implemented using feature combinations from Table 1, and feature vectors can be plugged into any learning algorithm. We use Naive Bayes, since it was previously shown that in combination with the features we consider, can lead to a state-of-the-art disambiguation system (Lee and Ng, 2002). Moreover, Naive Bayes is particularly suitable for co-training and self-training, since it provides confidence scores and is efficient in terms of training and testing time.</Paragraph>
      <Paragraph position="1"> The two separate views required for co-training are defined using a local versus topical feature split. For selftraining, a global classifier with no feature split is defined. A local classifier A local classifier was implemented using all local features listed in Table 1.</Paragraph>
      <Paragraph position="2"> A topical classifier The topical classifier relies on features extracted from a large context, in particular keywords specific to each individual sense. We use the SK feature, and extract at most ten keywords for each word sense, each occurring for at least three times in the annotated corpus.</Paragraph>
      <Paragraph position="3"> A global classifier Finally, the global classifier integrates all local and topical features, also in a Naive Bayes classifier. This classifier is basically a combination of the previous two local and topical classifiers.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="35" type="metho">
    <SectionTitle>
4 Co-training and Self-training for Word
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Sense Disambiguation
</SectionTitle>
      <Paragraph position="0"> We investigate the application of co-training and self-training to the problem of supervised word sense disambiguation, and explore methods for selecting values for the bootstrapping parameters.</Paragraph>
      <Paragraph position="1"> The data set used in this study consists in training and test data made available during the English lexical sample task in the SENSEVAL-2 evaluation exercise. In addition to these data sets, a large raw corpus of unlabeled examples is constructed for each word, with text snippets consisting of three consecutive sentences extracted from the British National Corpus. Given the large number of runs performed for each word, the experiments focus on nouns only. Similar observations are however expected to hold for other parts of speech.</Paragraph>
      <Paragraph position="2"> For co-training, we use the local and topical classifiers described in Section 3.4, which represent two different views for this problem, generated by a &amp;quot;local versus topical&amp;quot; feature split. Self-training requires only one basic classifier, and we use a global classifier, which combines the features from both local and topical views for a complete global &amp;quot;view&amp;quot;.</Paragraph>
      <Paragraph position="3"> Unlike previous applications of co-training and self-training to natural language learning, where one general classifier is build to cover the entire problem space, supervised word sense disambiguation implies a different classifier for each individual word, resulting eventually in thousands of different classifiers, each with its own characteristics (learning rate, sensitivity to new examples, etc.). Given this heterogeneous space of classifiers, our hypothesis is that co-training and self-training will themselves have a heterogeneous behavior, and therefore best co-training and self-training parameters are different for each classifier.</Paragraph>
      <Paragraph position="4"> To explore this hypothesis, a range of experiments is performed. First, for all the words in the experimental data set, an optimal parameter setting is determined. This can be considered as an upper bound for improvements achieved with co-training and self-training, since the selection of parameters is performed through measurements that are collected directly on test data. Second, we explore several algorithms to select the bootstrapping parameters, independent of the test set: (1) Best overall parameter setting; (2) Best individual parameter settings; (3) Best per-word parameter selection; (4) A new method consisting in an improved bootstrapping scheme using majority voting across several bootstrapping iterations.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Optimal settings
</SectionTitle>
      <Paragraph position="0"> Optimal parameter settings are determined through measurements performed directly on the test set. For the growth size G, a value is chosen from the set: a0 1, 10, 20, 30, 40, 50, 100, 150, 200 a1 . The pool size P takes one of these values: a0 1, 100, 500, 1000, 1500, 2000, 5000 a1 . For each setting, 40 iterations are performed. This results in an average of 2,120 classification runs per word. At each run, a pool of P raw examples is annotated, and G most confidently labeled examples are added to the training set from the previous iteration. The performance of the classifier using the augmented training set is evaluated on the test data, and the precision is recorded.</Paragraph>
      <Paragraph position="1"> Separate experiments are performed for both co-training and self-training, for all the nouns in the SENSEVAL-2 data set (for a total of about 120,000 runs).</Paragraph>
      <Paragraph position="2"> For each word, the parameter setting (growth size G / pool size P / iterations I) leading to the highest improvement is determined. Table 21 lists, for each word: size of training, test, raw data2; precision of the basic classifier (the global classifier is used as a baseline); maximum precision obtained with co-training and self-training, and the parameters for which this maximum is achieved. When several parameter settings lead to the same performance, the first setting is recorded.</Paragraph>
      <Paragraph position="3">  Surprisingly, under optimal settings, both co-training and self-training perform about the same, leading to an average error reduction of 25.5%. Self-training leads to the highest precision for nine words, while co-training is winning for eight words; there is a tie with equal performance for both co-training and self-training for the remaining twelve words.</Paragraph>
      <Paragraph position="4"> There are three words (chair, holiday, spade) for which no improvement could be obtained with either co-training or self-training, and therefore no optimal setting is indicated. These are among the four words with the best performing basic classifier (baseline higher than 75%). The fact that no improvement was obtained agrees with previous observations that classifiers that are too accurate cannot be improved with bootstrapping (Pierce and Cardie, 2001). Note that even very weak classifiers, with precisions below 40%, can still be improved, sometimes with as much as 45% error reduction (e.g. the classifier for feeling).</Paragraph>
      <Paragraph position="5"> There are no clear commonalities between the parameters leading to maximum precision for different classifiers. Some classifiers benefit more from an &amp;quot;aggressive&amp;quot; augmentation of the training data with new examples for instance the self-trained classifier for nature achieves its highest peak for a growth size of 200 from a pool of  basic classifier precision refer to data sets obtained after removing examples with collocations that include the target word. This explains why the numbers do not always match figures previously reported in SENSEVAL-2 literature. If collocations are added back to the data sets, the precision of the basic classifier is measured at 60.2% - comparable to figures obtained by other systems participating in SENSEVAL-2 2The raw corpus for each word is formed with all examples retrieved from the British National Corpus. While this ensures a natural distribution for each word (in terms of number of examples occurring in a balanced corpus), it also leads to discrepancies in terms of raw data size. For words with less than 5000 raw examples, the pool size recorded in the &amp;quot;optimal setting&amp;quot;' column represents a round-up to the nearest number from the set of allowed pool values.</Paragraph>
      <Paragraph position="6">  parameter settings for self-training and co-training. The optimal settings column lists the values for the three parameters (growth size G / pool size P / iteration I) for which the maximum precision was observed. The improvements obtained under these optimal settings can be considered as an upper bound for self-training and co-training. Under some ideal conditions, where the optimal parameters can be identified, this is the highest improvement that can be achieved for the given labeled set. However, most of the times, it may not be possible to find these optimal parameter values. In the following, we explore empirical solutions for finding values for these parameters, independent of the test data.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="35" type="sub_section">
      <SectionTitle>
4.2 Empirical Settings
</SectionTitle>
      <Paragraph position="0"> The methods described in this section make use of the data collected during self-training and co-training runs, for different parameter settings. Evaluations are performed on a validation set consisting of about 20% of the training data - which was set apart for this purpose. For each run, information is collected about: initial size of labeled data set, growth size, pool size, iteration number, precision of basic classifier, precision of boosted classifier. With the range of values for growth size, pool size, and number of iterations specified in Section 4.1, about 60,000 such records are collected for both co-training and self-training.</Paragraph>
      <Paragraph position="1">  One simple method to select a parameter setting is to determine a global setting that leads to the highest overall boost in precision. Starting with the information collected for the 60,000 runs, for each possible parameter setting, the total relative growth in performance is determined, by adding up the relative improvements for all the runs for that particular setting. For co-training, the best global setting identified in this way is growth size of 50, pool size of 5000, iteration 2. For self-training, the best setting is growth size of 1, pool size of 1500, iteration 2. The precision obtained for these settings is listed in Table 3 under the global settings column. On average, using this scheme for parameter selection, co-training brings 4% error reduction, while self-training has only a small error reduction of 1%.</Paragraph>
      <Paragraph position="2"> In a similar approach, the value for each parameter is determined independent of the other parameters. Instead of selecting the best value for all parameters at once, values are selected individually. Again, for each possible parameter value, the total relative growth in performance is determined, and the value leading to the highest growth is selected. Interestingly, for both co-training and selftraining, the best values identified in this way for the three  parameters are growth size of 1, pool size of 1, iteration 1 (i.e. the best classifier is the one &amp;quot;closest&amp;quot; to the basic classifier). The average results are however worse than the baseline - only 53.49% for co-training, and 53.67% for self-training.</Paragraph>
      <Paragraph position="3">  In a second experiment, best parameter settings are identified separately for each word. The setting yielding to maximum precision on the validation set is selected as the best setting for a given word, and evaluated on the test data. If multiple settings are identified as leading to maximum precision, settings are prioritized based on (in this order): smallest growth size; largest pool size; number of iterations. Results for both co-training and self-training are listed in Table 3 under the per-word settings column, together with the setting identified as optimal on the validation set. There are several words for which significant improvement is observed over the baseline. However, on the average, the performance of the boosted classifiers is worse than the baseline.</Paragraph>
      <Paragraph position="4">  majority voting There is a common trend observed for learning curves for co-training or self-training, consisting in an increase in performance followed by a decline. Different classifiers exhibit however a different point of raise or decline in precision, depending on the number of iterations. For instance, the classifier for circuit achieves its highest peak at iteration 10 (see Table 2), while the classifier for nation has the highest boost at iteration 21 - where the performance for circuit is already below the baseline. Given this heterogeneous behavior, it is difficult to identify a point of maximum for each classifier, or at least a point where the performance is not below the baseline. Ideally, we would like the learning curves to have a more stable behavior without sharp raises or drops in precision, and with larger intervals with constant performance, so that the chance of selecting a good number of iterations for each classifier is increased.</Paragraph>
      <Paragraph position="5"> We introduce a new bootstrapping scheme that combines co-training or self-training with majority voting. During the bootstrapping process, the classifier at each iteration is replaced with a majority voting scheme applied to all classifiers constructed at previous iterations. This change has the effect of &amp;quot;smoothing&amp;quot; the learning curves: it slows down the learning rate, but also yields a larger interval with constant high performance3.</Paragraph>
      <Paragraph position="6"> 3Notice that in smoothed co-training, majority voting is applied on classifiers consisting of iterations of the co-training process itself, and therefore voting is applied on bootstrapped clas- null To some extent, smoothed co-training is related to boosting, since both algorithms rely on a growing ensemble of classifiers trained on resamples of the data. However, boosting assumes labeled data and is error-driven, whereas smoothed co-training combines both labeled and unlabeled data and is confidence-driven4.</Paragraph>
      <Paragraph position="7"> Figure 2 shows the learning curves for simple cotraining, and co-training &amp;quot;smoothed&amp;quot; with majority voting, for the word authority (for a growth size of 1 and pool size of 1). Notice that the trend for the smoothed curve is still the same - a raise, followed by a decline - but at a significantly lower pace. With smoothed co-training, any number of iterations selected in the interval 5-40 still leads to significant improvement over the baseline, unlike the simple unsmoothed curve, where only iterations in the range 3-10 bring improvement over the baseline (followed by two other iterations at random intervals).</Paragraph>
      <Paragraph position="8"> The methods for global parameter settings and per-word parameter settings are evaluated again, this time using smoothed co-training or self-training. Table 4 lists the results obtained with basic and smoothed co-training for the same global/per-word setting. Since the majority voting scheme requires an odd number of classifiers, sifiers across co-training iterations, with the effect of improving the performance of basic co-training. This is fundamentally different from the approach proposed in (Ng and Cardie, 2003), where they also apply majority voting in a bootstrapping framework, but in a different setting. They use a majority voting scheme applied to classifiers build on subsets of the labeled data (bagging) to induce several views for the co-training process. In their approach, majority voting is used at each co-training iteration to enable co-training by predicting labels on unlabeled data.  per-word parameter settings (same settings as listed in Table 3) the number of iterations is rounded up to the next even number (the first iteration is iteration 0, representing the basic classifier, which is also considered during voting).</Paragraph>
      <Paragraph position="9"> The same type of experiments were also performed for self-training, but the majority voting scheme did not bring any significant improvements. We believe that the learning curves for self-training are less steep, and therefore majority voting applied to classifiers across various iterations does not have the same strong smoothing effect as with co-training.</Paragraph>
      <Paragraph position="10">  For parameter selection using global settings (Table 3) co-training improves over the basic classifiers, and outperforms self-training. As previously noticed (Nigam and Ghani, 2000), it is hard to identify conditionally independent views for real-data problems. Even though we use a &amp;quot;local versus topical&amp;quot; feature split, which divides the features into two separate views on sense classification, there might be some natural dependencies between the features, since they are extracted from the same context, which may weaken the independence condition, and may sometime make the behavior of co-training similar to a self-training process. However, as theoretically shown in (Abney, 2002), and then empirically in (Clark et al., 2003), co-training still works under a weaker independence assumption, and the results we obtain concur with these previous observations.</Paragraph>
      <Paragraph position="11"> Despite the fact that parameters observed for optimal settings (Table 2) are different for each classifier, in empirical settings, one unique set of parameters for all classifiers seems to perform better than an individual set of parameters customized to each word. The bootstrapping scheme is improved even more when coupled with majority voting across various iterations. Overall, the highest error reduction is achieved with smoothed co-training using global parameter settings, where an average error reduction of 9.8% is observed with respect to the basic classifier.</Paragraph>
      <Paragraph position="12"> A comparative analysis of words that benefit from basic/smoothed co-training with global parameter settings, versus words with little or no improvement obtained through bootstrapping reveals several observations: (1) Words with accurate basic classifiers cannot be improved through co-training, which agrees with previous observations (Pierce and Cardie, 2001). For instance, no improvement was obtained for chair, holiday, or spade, which have the basic classifier performing above 75%.</Paragraph>
      <Paragraph position="13"> (2) Words with high number of senses (e.g. bar - 10 senses, channel - 7 senses, grip - 11 senses) achieve minimal improvements through co-training. This is probably explained by the fact that the classifiers are misled by the large number of classes (senses), and a large number of errors is introduced since the early stages of co-training. (3) Words that have a large number of senses not belonging to well-defined topical domains show little or no benefit from a bootstrapping procedure. Using the domains attached to word senses, as introduced in (Magnini et al., 2002), we observed that words that have a large subset of their senses not belonging to a specific domain (e.g. restraint, facility) achieve little or no improvement through co-training, which is perhaps explained again by the noisy automatic annotation that introduces errors since the early iterations of co-training.</Paragraph>
      <Paragraph position="14"> Even though not all words show benefit from cotraining, smoothed co-training with global parameter settings does bring an overall error reduction of 9.8% with respect to the basic classifier, which proves that bootstrapping through co-training is a potentially useful technique for word sense disambiguation.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML