File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/w04-2405_intro.xml
Size: 6,980 bytes
Last Modified: 2025-10-06 14:02:44
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-2405"> <Title>Co-training and Self-training for Word Sense Disambiguation</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> 2 Co-training and Self-training for Natural </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Language Learning </SectionTitle> <Paragraph position="0"> Co-training and self-training are bootstrapping methods that aim to improve the performance of a supervised learning algorithm by incorporating large amounts of unlabeled data into the training data set.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Co-training </SectionTitle> <Paragraph position="0"> Starting with a set of labeled data, co-training algorithms (Blum and Mitchell, 1998) attempt to increase the amount of annotated data using some (large) amounts of unlabeled data. Shortly, co-training algorithms work by generating several classifiers trained on the input labeled data, which are then used to tag new unlabeled data. From this newly annotated data, the most confident predictions are sought, and subsequently added to the set of labeled data. The process may continue for several iterations.</Paragraph> <Paragraph position="1"> In natural language learning, co-training was applied to statistical parsing (Sarkar, 2001), reference resolution (Ng and Cardie, 2003), part of speech tagging (Clark et al., 2003), and others, and was generally found to bring improvement over the case when no additional unlabeled data are used.</Paragraph> <Paragraph position="2"> One important aspect of co-training consists in the relation between the views used in learning. In the original definition of co-training, (Blum and Mitchell, 1998) state conditional independence of the views as a required criterion for co-training to work. In recent work, (Abney, 2002) shows that the independence assumption can be relaxed, and co-training is still effective under a weaker independence assumption. He is proposing a greedy algorithm to maximize agreement on unlabelled data, which produces good results in a co-training experiment for named entity classification. Moreover, (Clark et al., 2003) show that a naive co-training process that does not explicitly seek to maximize agreement on unlabelled data can lead to similar performance, at a much lower computational cost. In this work, we apply co-training by identifying two different feature sets based on a &quot;local versus topical&quot; feature split, which represent potentially independent views for word sense classification, as shown in Section 4.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Self-training </SectionTitle> <Paragraph position="0"> While there is a common agreement on the definition of co-training, the literature provides several sometimes conflicting definitions for self-training. (Ng and Cardie, 2003) define self-training as a &quot;single-view weakly supervised algorithm&quot;, build by training a committee of classifiers using bagging, combined with majority voting for final label selection. (Clark et al., 2003) provide a different definition: self-training is performed using &quot;a tagger that is retrained on its own labeled cache on each round&quot;.</Paragraph> <Paragraph position="1"> We adopt this second definition, which also agrees with the definition given in (Nigam and Ghani, 2000).</Paragraph> <Paragraph position="2"> Self-training starts with a set of labeled data, and builds a classifier, which is then applied on the set of unlabeled data. Only those instances with a labeling confidence exceeding a certain threshold are added to the labeled set.</Paragraph> <Paragraph position="3"> The classifier is then retrained on the new set of labeled examples, and the process continues for several iterations.</Paragraph> <Paragraph position="4"> Notice that only one classifier is required, with no split of features.</Paragraph> <Paragraph position="5"> Figure 1 illustrates the general bootstrapping process.</Paragraph> <Paragraph position="6"> Starting with a set of labeled and unlabeled data, the bootstrapping algorithm aims to improve the classification performance, by integrating examples from the unlabeled data into the labeled data set. At each iteration, the class distribution in the labeled data is maintained, by keeping a constant ratio across classes between already labeled examples and newly added examples; the role of this step is to avoid introducing imbalance in the training data set. For co-training, the algorithm requires two different views (two different classifiers a0a2a1 and a0a4a3 ) that interact in the bootstrapping process. By limiting the number of views to one (one general classifier a0a2a1 ), co-training is transformed into a self-training process, where one single classifier learns from its own output.</Paragraph> <Paragraph position="7"> 0. Given: - A set L of labeled training examples - A set U of unlabeled examples - Classifiers a5a7a6 1. Create a pool U' of examples by choosing P random examples from U 2. Loop for I iterations: 2.1 Use L to individually train the classifiers a5 a6 , and label the examples in U' 2.2 For each classifier a5a7a6 select G most confidently examples and add them to L, while maintaining the class distribution in L 2.3 Refill U' with examples from U, to keep U' at a constant size of P examples unlabeled data Co-training and self-training parameters Three different parameters can be set in the bootstrapping process, and usually the performance achieved through bootstrapping depends on the value chosen for these parameters. null examples that are added at each iteration to the set of labeled data L.</Paragraph> <Paragraph position="8"> As previously noticed (Ng and Cardie, 2003), there is no principled method for selecting optimal values for these parameters, which is an important disadvantage of these algorithms. In the following, we show that there is a big gap between the performance achieved for some optimal parameter settings, selected through measurements performed on the test data, and the performance level when these parameters are set empirically, suggesting that more research is required to narrow this gap, and make these bootstrapping algorithms useful for practical applications. null First, we describe the general framework of supervised word sense disambiguation, and introduce several basic sense classifiers that are used in co-training and self-training experiments. Next, through several experiments, (1) we determine the optimal parameter settings for co-training and self-training, and (2) explore various algorithms for empirical selection of these three parameters for best performance.</Paragraph> </Section> </Section> class="xml-element"></Paper>