File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/p01-1005_metho.xml
Size: 19,292 bytes
Last Modified: 2025-10-06 14:07:39
<?xml version="1.0" standalone="yes"?> <Paper uid="P01-1005"> <Title>Scaling to Very Very Large Corpora for Natural Language Disambiguation</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Confusion Set Disambiguation </SectionTitle> <Paragraph position="0"> Confusion set disambiguation is the problem of choosing the correct use of a word, given a set of words with which it is commonly confused.</Paragraph> <Paragraph position="1"> Example confusion sets include: {principle , principal}, {then, than}, {to,two,t oo}, and {weather,whether}.</Paragraph> <Paragraph position="2"> Numerous methods have been presented for confusable disambiguation. The more recent set of techniques includes mult iplicative weightupdate algorithms (Golding and Roth, 1998), latent semantic analysis (Jones and Martin, 1997), transformation-based learning (Mangu and Brill, 1997), differential grammars (Powers, 1997), decision lists (Yarowsky, 1994), and a variety of Bayesian classifiers (Gale et al., 1993, Golding, 1995, Golding and Schabes, 1996). In all of these approaches, the problem is formulated as follows: Given a specific confusion set (e.g. {to,two,too}), all occurrences of confusion set members in the test set are replaced by a marker; everywhere the system sees this marker, it must decide which member of the confusion set to choose.</Paragraph> <Paragraph position="3"> Confusion set disambiguation is one of a class of natural language problems involving disambiguation from a relatively small set of alternatives based upon the string context in which the ambiguity site appears. Other such problems include word sense disambiguation, part of speech tagging and some formulations of phrasal chunking. One advantageous aspect of confusion set disambiguation, which allows us to study the effects of large data sets on performance, is that labeled training data is essentially free, since the correct answer is surface apparent in any collection of reasonably well-edited text.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Learning Curve Expe riments </SectionTitle> <Paragraph position="0"> This work was partially motivated by the desire to develop an improved grammar checker.</Paragraph> <Paragraph position="1"> Given a fixed amount of time, we considered what would be the most effective way to focus our efforts in order to attain the greatest performance improvement. Some possibilities included modifying standard learning algorithms, exploring new learning techniques, and using more sophisticated features. Before exploring these somewhat expensive paths, we decided to first see what happened if we simply trained an existing method with much more data. This led to the exploration of learning curves for various machine learning algorithms : winnow1, perceptron, naive Bayes, and a very simple memory-based learner. For the first three learners, we used the standard colle ction of features employed for this problem: the set of words within a window of the target word, and collocations containing words and/or parts of 1 Thanks to Dan Roth for making both Winnow and Perceptron available.</Paragraph> <Paragraph position="2"> speech. The memory-based learner used only the word before and word after as features.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Disambiguation </SectionTitle> <Paragraph position="0"> We collected a 1-billion-word training corpus from a variety of English texts, including news articles, scientific abstracts, government transcripts, literature and other varied forms of prose. This training corpus is three orders of magnitude greater than the largest training corpus previously used for this problem. We used 1 million words of Wall Street Journal text as our test set, and no data from the Wall Street Journal was used when constructing the training corpus. Each learner was trained at several cutoff points in the training corpus, i.e. the first one million words, the first five million words, and so on, until all one billion words were used for training. In order to avoid training biases that may result from merely concatenating the different data sources to form a larger training corpus, we constructed each consecutive training corpus by probabilistically sampling sentences from the different sources weighted by the size of each source.</Paragraph> <Paragraph position="1"> In Figure 1, we show learning curves for each learner, up to one billion words of training data. Each point in the graph is the average performance over ten confusion sets for that size training corpus. Note that the curves appear to be log-linear even out to one billion words.</Paragraph> <Paragraph position="2"> Of course for many problems, additional training data has a non-zero cost. However, these results suggest that we may want to reconsider the trade-off between spending time and money on algorithm development versus spending it on corpus development. At least for the problem of confusable disambiguation, none of the learners tested is close to asymptoting in performance at the training corpus size commonly employed by the field.</Paragraph> <Paragraph position="3"> Such gains in accuracy, however, do not come for free. Figure 2 shows the size of learned representations as a function of training data size. For some applications, this is not necessarily a concern. But for others, where space comes at a premium, obtaining the gains that come with a billion words of training data may not be viable without an effort made to compress information. In such cases, one could look at numerous methods for compressing data (e.g. Dagan and Engleson, 1995, Weng, et al, 1998).</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 The Efficacy of Voting </SectionTitle> <Paragraph position="0"> Voting has proven to be an effective technique for improving classifier accuracy for many applications, including part-of-speech tagging (van Halteren, et al, 1998), parsing (Henderson and Brill, 1999), and word sense disambiguation (Pederson, 2000). By training a set of classifiers on a single training corpus and then combining their outputs in classification, it is often possible to achieve a target accuracy with less labeled training data than would be needed if only one cla ssifier was being used. Voting can be effective in reducing both the bias of a particular training corpus and the bias of a specific learner.</Paragraph> <Paragraph position="1"> When a training corpus is very small, there is much more room for these biases to surface and therefore for voting to be effective. But does voting still offer performance gains when classifiers are trained on much larger corpora? The complementarity between two learners was defined by Brill and Wu (1998) in order to quantify the percentage of time when one system is wrong, that another system is correct, and therefore providing an upper bound on combination accuracy. As training size increases significantly, we would expect complementarity between classifiers to decrease.</Paragraph> <Paragraph position="2"> This is due in part to the fact that a larger training corpus will reduce the data set variance and any bias arising from this. Also, some of the differences between classifiers might be due to how they handle a sparse training set.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Corpus Size </SectionTitle> <Paragraph position="0"> As a result of comparing a sample of two learners as a function of increasingly large training sets, we see in Table 1 that complementarity does indeed decrease as training size increases.</Paragraph> <Paragraph position="1"> Next we tested whether this decrease in complementarity meant that voting loses its effectiveness as the training set increases. To examine the impact of voting when using a significantly larger training corpus, we ran 3 out of the 4 learners on our set of 10 confusable pairs, excluding the memory-based learner. Voting was done by combining the normalized score each learner assigned to a classification choice. In Figure 3, we show the accuracy obtained from voting, along with the single best learner accuracy at each training set size. We see that for very small corpora, voting is beneficial, resulting in better performance than any single classifier. Beyond 1 million words, little is gained by voting, and indeed on the largest training sets voting actually hurts accuracy.</Paragraph> </Section> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 When Annotated Data Is Not Free </SectionTitle> <Paragraph position="0"> While the observation that learning curves are not asymptoting even with orders of magnitude more training data than is currently used is very exciting, this result may have somewhat limited ramifications. Very few problems exist for which annotated data of this size is available for free. Surely we cannot reasonably expect that the manual annotation of one billion words along with corresponding parse trees will occur any time soon (but see (Banko and Brill 2001) for a discussion that this might not be completely infeasible). Despite this pitfall, there are techniques one can use to try to obtain the benefits of considerably larger training corpora without incurring significant additional costs. In the sections that follow, we study two such solutions: active learning and unsupervised learning.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.1 Active Learning </SectionTitle> <Paragraph position="0"> Active learning involves intelligently selecting a portion of samples for annotation from a pool of as-yet unannotated training samples. Not all samples in a training set are equally useful. By concentrating human annotation efforts on the samples of greatest utility to the machine learning algorithm, it may be possible to attain better performance for a fixed annotation cost than if samples were chosen randomly for human annotation.</Paragraph> <Paragraph position="1"> Most active learning approaches work by first training a seed learner (or family of learners) and then running the learner(s) over a set of unlabeled samples. A sample is presumed to be more useful for training the more uncertain its classification label is.</Paragraph> <Paragraph position="2"> Uncertainty can be judged by the relative weights assigned to different labels by a single classifier (Lewis and Catlett, 1994). Another approach, committee-based sampling, first creates a committee of classifie rs and then judges classification uncertainty according to how much the learners differ among label assignments. For example, Dagan and Engelson (1995) describe a committee-based sampling technique where a part of speech tagger is trained using an annotated seed corpus. A family of taggers is then generated by randomly permuting the tagger probabilities, and the disparity among tags output by the committee members is used as a measure of classification uncertainty. Sentences for human annotation are drawn, biased to prefer those containing high uncertainty instances.</Paragraph> <Paragraph position="3"> While active learning has been shown to work for a number of tasks, the majority of active learning experiments in natural language processing have been conducted using very small seed corpora and sets of unlabeled examples. Therefore, we wish to explore situations where we have, or can afford, a non-negligible sized training corpus (such as for part-of-speech tagging) and have access to very large amounts of unlabeled data.</Paragraph> <Paragraph position="4"> We can use bagging (Breiman, 1996), a technique for generating a committee of classifiers, to assess the label uncertainty of a potential training instance. With bagging, a variant of the original training set is constructed by randomly sampling sentences with replacement from the source training set in order to produce N new training sets of size equal to the original. After the N models have been trained and run on the same test set, their classifications for each test sentence can be compared for classification agreement. The higher the disagreement between classifiers, the more useful it would be to have an instance We used the naive Bayes classifier, creating 10 classifiers each trained on bags generated from an initial one million words of labeled training data. We present the active learning algorithm we used below.</Paragraph> <Paragraph position="5"> Initialize: Training data consists of X words correctly labeled Iterate : 1) Generate a committee of classifiers using bagging on the training set 2) Run the committee on unlabeled portion of the training set 3) Choose M instances from the unlabeled set for labeling - pick the M/2 with the greatest vote entropy and then pick another M/2 randomly - and add to training set We initially tried selecting the M most uncertain examples, but this resulted in a sample too biased toward the difficult instances.</Paragraph> <Paragraph position="6"> Instead we pick half of our samples for annotation randomly and the other half from those whose labels we are most uncertain of, as judged by the entropy of the votes assigned to the instance by the committee. This is, in effect, biasing our sample toward instances the classifiers are most uncertain of.</Paragraph> <Paragraph position="7"> We show the results from sample selection for confusion set disambiguation in test set accuracy achieved for different percentages of the one billion word training set, where training instances are taken at random. We ran three active learning experiments, increasing the size of the total unlabeled training corpus from which we can pick samples to be annotated. In all three cases, sample selection outperforms sequential sampling. At the endpoint of each training run in the graph, the same number of samples has been annotated for training. However, we see that the larger the pool of candidate instances for annotation is, the better the resulting accuracy. By increasing the pool of unlabeled training instances for active learning, we can improve accuracy with only a fixed additional annotation cost. Thus it is possible to benefit from the availability of extremely large corpora without incurring the full costs of annotation, training time, and representation size.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.2 Weakly Supervised Learning </SectionTitle> <Paragraph position="0"> While the previous section shows that we can benefit from substantially larger training corpora without needing significant additional manual annotation, it would be ideal if we could improve classification accuracy using only our seed annotated corpus and the large unlabeled corpus, without requiring any additional hand labeling. In this section we turn to unsupervised learning in an attempt to achieve this goal.</Paragraph> <Paragraph position="1"> Numerous approaches have been explored for exploiting situations where some amount of annotated data is available and a much larger amount of data exists unannotated, e.g.</Paragraph> <Paragraph position="2"> Marialdo's HMM part-of-speech tagger training (1994), Charniak's parser retraining experiment (1996), Yarowsky's seeds for word sense disambiguation (1995) and Nigam et al's (1998) topic classifier learned in part from unlabelled documents. A nice discussion of this general problem can be found in Mitchell (1999).</Paragraph> <Paragraph position="3"> The question we want to answer is whether there is something to be gained by combining unsupervised and supervised learning when we scale up both the seed corpus and the unlabeled corpus significantly. We can again use a committee of bagged classifiers, this time for unsupervised learning. Whereas with active learning we want to choose the most uncertain instances for human annotation, with unsupervised learning we want to choose the instances that have the highest probability of being correct for automatic labeling and inclusion in our labeled training data.</Paragraph> <Paragraph position="4"> In Table 2, we show the test set accuracy (averaged over the four most frequently occurring confusion pairs) as a function of the number of classifiers that agree upon the label of an instance. For this experiment, we trained a collection of 10 naive Bayes classifiers, using bagging on a 1-million-word seed corpus. As can be seen, the greater the classifier agreement, the more likely it is that a test sample has been correctly labeled.</Paragraph> <Paragraph position="5"> Since the instances in which all bags agree have the highest probability of being correct, we attempted to automatically grow our labeled training set using the 1-million-word labeled seed corpus along with the collection of naive Bayes classifiers described above. All instances from the remainder of the corpus on which all 10 classifiers agreed were selected, trusting the agreed-upon label. The classif iers were then retrained using the labeled seed corpus plus the new training material collected automatically during the previous step.</Paragraph> <Paragraph position="6"> In Table 3 we show the results from these unsupervised learning experiments for two confusion sets. In both cases we gain from unsupervised training compared to using only the seed corpus, but only up to a point. At this point, test set accuracy begins to decline as additional training instances are automatically harvested. We are able to attain improvements in accuracy for free using unsupervised learning, but unlike our learning curve experiments using correctly labeled data, accuracy does not continue to improve with additional data.</Paragraph> <Paragraph position="7"> {then, than} {among, between} Charniak (1996) ran an experiment in which he trained a parser on one million words of parsed data, ran the parser over an additional 30 million words, and used the resulting parses to reestimate model probabilities. Doing so gave a small improvement over just using the manually parsed data. We repeated this experiment with our data, and show the outcome in Table 4. Choosing only the labeled instances most likely to be correct as judged by a committee of classifiers results in higher accuracy than using all instances classified by a model trained with the labeled seed corpus.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Methods </SectionTitle> <Paragraph position="0"> In applying unsupervised learning to improve upon a seed-trained method, we consistently saw an improvement in performance followed by a decline. This is likely due to eventually having reached a point where the gains from additional training data are offset by the sample bias in mining these instances. It may be possible to combine active learning with unsupervised learning as a way to reduce this sample bias and gain the benefits of both approaches.</Paragraph> </Section> </Section> class="xml-element"></Paper>