File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/03/w03-0407_evalu.xml
Size: 14,205 bytes
Last Modified: 2025-10-06 13:58:56
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-0407"> <Title>Bootstrapping POS taggers using Unlabelled Data</Title> <Section position="5" start_page="0" end_page="1" type="evalu"> <SectionTitle> 4 Experiments </SectionTitle> <Paragraph position="0"> The co-training framework uses labelled examples from one tagger as additional training data for the other. For the purposes of this paper, a labelled example is a tagged sentence. We chose complete sentences, rather than smaller units, because this simplifies the experiments and the publicly available version of TNT requires complete tagged sentences for training. It is possible that co-training with sub-sentential units might be more effective, but we leave this as future work.</Paragraph> <Paragraph position="1"> The co-training process is given in Figure 1. At each stage in the process there is a cache of unlabelled sentences (selected from the total pool of unlabelled sentences) which is labelled by each tagger.</Paragraph> <Paragraph position="2"> The cache size could be increased at each iteration, which is a common practice in the co-training literature. A subset of those sentences labelled by TNTis then added to the training data for C&C, and vice versa.</Paragraph> <Paragraph position="3"> Blum and Mitchell (1998) use the combined set of newly labelled examples for training each view, but we follow Goldman and Zhou (2000) in using separate labelled sets. In the remainder of this section we consider two possible methods for selecting a subset. The cache is cleared after each iteration.</Paragraph> <Paragraph position="4"> There are various ways to select the labelled examples for each tagger. A typical approach is to select those examples assigned a high score by the relevant classifier, under the assumption that these examples will be the most reliable. A score-based selection method is difficult to apply in our experiments, however, since TNT does not provide scores for tagged sentences.</Paragraph> <Paragraph position="5"> We therefore tried two alternative selection methods.</Paragraph> <Paragraph position="6"> The first is to simply add all of the cache labelled by one tagger to the training data of the other. We refer to this method as naive co-training. The second, more sophisticated, method is to select that subset of the labelled cache which maximises the agreement of the two taggers on unlabelled data. We call this method agreement-based cotraining. For a large cache the number of possible subsets makes exhaustive search intractable, and so we randomly sample the subsets.</Paragraph> <Paragraph position="7"> The pseudo-code for the agreement-based selection method is given in Figure 2. The current tagger is the one being retrained, while the other tagger is kept static. The co-training process uses the selection method for selecting sentences from the cache (which has been labelled by one of the taggers). Note that during the selection process, we repeatedly sample from all possible subsets of the cache; this is done by first randomly choosing the size of the subset and then randomly choosing sentences based on the size. The number of subsets we consider is determined by the number of times the loop is traversed in Figure 2.</Paragraph> <Paragraph position="8"> If TNT is being trained on the output of C&C, then the most recent version of C&C is used to measure agreement (and vice versa); so we first attempt to improve one tagger, then the other, rather than both at the same time. The agreement rate of the taggers on unlabelled sentences is the per-token agreement rate; that is, the number of times each word in the unlabelled set of sentences is assigned the same tag by both taggers.</Paragraph> <Paragraph position="9"> For the small seed set experiments, the seed data was an arbitrarily chosen subset of sections 10-19 of the WSJ Penn Treebank; the unlabelled training data was taken from 50;000 sentences of the 1994 WSJ section of the North American News Corpus (NANC); and the unlabelled data used to measure agreement was around 10;000 sentences from sections 1-5 of the Treebank.</Paragraph> <Paragraph position="10"> Section 00 of the Treebank was used to measure the accuracy of the taggers. The cache size was 500 sentences.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Self-Training and Agreement-based Co-training Results </SectionTitle> <Paragraph position="0"> Figure 3 shows the results for self-training, in which each tagger is simply retrained on its own labelled cache at each round. (By round we mean the re-training of a single tagger, so there are two rounds per co-training iteration.) TNT does improve using self-training, from 81:4% to 82:2%, but C&C is unaffected. Re-running these experiments using a range of unlabelled training sets, from training, using a cache size of 500 and searching through 100 subsets of the labelled cache to find the one that maximises agreement. Co-training improves the performance of both taggers: TNT improves from 81:4% to 84:9%, and C&C improves from 73:2% to 84:3% (an error reduction of over 40%).</Paragraph> <Paragraph position="1"> Figures 5 and 6 show the self-training results and agreement-based results when a larger seed set, of 500 sentences, is used for each tagger. In this case, self-training harms TNTandC&C is again unaffected. Co-training continues to be beneficial.</Paragraph> <Paragraph position="2"> Towards the end of the co-training run, more material is being selected for C&C than TNT. The experiments using a seed set size of 50 showed a similar trend, but the difference between the two taggers was less marked. By examining the subsets chosen from the labelled cache at each round, we also observed that a large proportion of the cache was being selected for both taggers.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Naive Co-training Results </SectionTitle> <Paragraph position="0"> Agreement-based co-training for POS taggers is effective but computationally demanding. The previous two agreement maximisation experiments involved retraining each tagger 2;500 times. Given this, and the observation that maximisation generally has a preference for selecting a large proportion of the labelled cache, we looked at naive co-training: simply retraining upon all available mate- null is for C&C.</Paragraph> <Paragraph position="1"> rial (i.e. the whole cache) at each round. Table 2 shows the naive co-training results after 50 rounds of co-training when varying the size of the cache. 50 manually labelled sentences were used as the seed material. Table 3 shows results for the same experiment, but this time with a seed set of 500 manually labelled sentences.</Paragraph> <Paragraph position="2"> We see that naive co-training improves as the cache size increases. For a large cache, the performance levels for naive co-training are very similar to those produced by our agreement-based co-training method. After 50 rounds of co-training using 50 seed sentences, the agreement rates for naive and agreement-based co-training were very similar: from an initial value of 73% to 97% agreement.</Paragraph> <Paragraph position="3"> Naive co-training is more efficient than agreement-based co-training. For the parameter settings used in the amount added after each round (500 seed sentences) the previous experiments, agreement-based co-training required the taggers to be re-trained 10 to 100 times more often then naive co-training. There are advantages to agreement-based co-training, however. First, the agreement-based method dynamically selects the best sample at each stage, which may not be the whole cache. In particular, when the agreement rate cannot be improved upon, the selected sample can be rejected. For naive co-training, new samples will always be added, and so there is a possibility that the noise accumulated at later stages will start to degrade performance (see Pierce and Cardie (2001)). Second, for naive co-training, the optimal amount of data to be added at each round (i.e. the cache size) is a parameter that needs to be determined on held out data, whereas the agreement-based method determines this automatically.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.3 Larger-Scale Experiments </SectionTitle> <Paragraph position="0"> We also performed a number of experiments using much more unlabelled training material than before. Instead of using 50;000 sentences from the 1994 WSJ section of the North American News Corpus, we used 417;000 sentences (from the same section) and ran the experiments until the unlabelled data had been exhausted.</Paragraph> <Paragraph position="1"> One experiment used naive co-training, with 50 seed sentences and a cache of size 500. This led to an agreement rate of 99%, with performance levels of 85:4% and 85:4% for TNTandC&C respectively. 230;000 sentences ( 5 million words) had been processed and were used as training material by the taggers. The other experiment used our agreement-based co-training approach (50 seed sentences, cache size of 1;000 sentences, exploring at most 10 subsets in the maximisation process per round). The agreement rate was 98%, with performance levels of 86:0% and 85:9% for both taggers. 124;000 sentences had been processed, of which 30;000 labelled sentences were selected for training TNT and 44;000 labelled sentences were selected for training C&C.</Paragraph> <Paragraph position="2"> Co-training using this much larger amount of unlabelled material did improve our previously mentioned results, but not by a large margin.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.4 Co-training using Imbalanced Views </SectionTitle> <Paragraph position="0"> It is interesting to consider what happens when one view is initially much more accurate than the other view. We trained one of the taggers on much more labelled seed data than the other, to see how this affects the co-training process. Both taggers were initialised with either 500 or 50 seed sentences, and agreement-based co-training was applied, using a cache size of 500 sentences. The results are shown in Table 4.</Paragraph> <Paragraph position="1"> Co-training continues to be effective, even when the two taggers are imbalanced. Also, the final performance of the taggers is around the same value, irrespective of the direction of the imbalance.</Paragraph> </Section> <Section position="5" start_page="0" end_page="1" type="sub_section"> <SectionTitle> 4.5 Large Seed Experiments </SectionTitle> <Paragraph position="0"> Although bootstrapping from unlabelled data is particularly valuable when only small amounts of training material are available, it is also interesting to see if self-training or co-training can improve state of the art POS taggers.</Paragraph> <Paragraph position="1"> For these experiments, both C&C and TNTwereinitially trained on sections 00-18 of the WSJ Penn Treebank, and sections 19-21 and 22-24 were used as the development and test sets. The 1994-1996 WSJ text from the NANC was used as unlabelled material to fill the cache.</Paragraph> <Paragraph position="2"> The cache size started out at 8000 sentences and increased by 10% in each round to match the increasing labelled training data. In each round of self-training or naive co-training 10% of the cache was randomly selected and added to the labelled training data. The experiments ran for 40 rounds.</Paragraph> <Paragraph position="3"> The performance of the different training regimes is listed in Table 5. These results show no significant improvement using either self-training or co-training with very large seed datasets. Self-training shows only a slight We have shown that co-training is an effective technique for bootstrapping POS taggers trained on small amounts of labelled data. Using unlabelled data, we are able to improve TNT from 81:3% to 86:0%, whilst C&C shows a much more dramatic improvement of 73:2% to 85:9%.</Paragraph> <Paragraph position="4"> Our agreement-based co-training results support the theoretical arguments of Abney (2002) and Dasgupta et al. (2002), that directly maximising the agreement rates between the two taggers reduces generalisation error. Examination of the selected subsets showed a preference for a large proportion of the cache. This led us to propose a naive co-training approach, which significantly reduced the computational cost without a significant performance penalty.</Paragraph> <Paragraph position="5"> We also showed that naive co-training was unable to improve the performance of the taggers when they had already been trained on large amounts of manually annotated data. It is possible that agreement-based co-training, using more careful selection, would result in an improvement. We leave these experiments to future work, but note that there is a large computational cost associated with such experiments.</Paragraph> <Paragraph position="6"> The performance of the bootstrapped taggers is still a long way behind a tagger trained on a large amount of manually annotated data. This finding is in accord with earlier work on bootstrapping taggers using EM (Elworthy, 1994; Merialdo, 1994). An interesting question would be to determine the minimum number of manually labelled examples that need to be used to seed the system before we can achieve comparable results as using all available manually labelled sentences.</Paragraph> <Paragraph position="7"> For our experiments, co-training never led to a decrease in performance, regardless of the number of iterations. The opposite behaviour has been observed in other applications of co-training (Pierce and Cardie, 2001).</Paragraph> <Paragraph position="8"> Whether this robustness is a property of the tagging problem or our approach is left for future work.</Paragraph> <Paragraph position="9"> This is probably by chance selection of better subsets.</Paragraph> </Section> </Section> class="xml-element"></Paper>