File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/j98-1005_metho.xml
Size: 43,827 bytes
Last Modified: 2025-10-06 14:14:48
<?xml version="1.0" standalone="yes"?> <Paper uid="J98-1005"> <Title>Disambiguating Highly Ambiguous Words</Title> <Section position="3" start_page="126" end_page="128" type="metho"> <SectionTitle> 3. Extracting Contextual Representations </SectionTitle> <Paragraph position="0"> Capturing syntagmatic relations is equivalent to creating contextual representations for the words within the lexicon. Miller and Charles (1991) define a contextual representation as a characterization of the linguistic contexts in which a word appears. In earlier work, we demonstrated that contextual representations consisting of both local and topical components are effective for resolving word senses and can be automatically extracted from sample texts (Leacock, Towell, and Voorhees 1996). The topical component consists of substantive words that are likely to co-occur with a given sense of the target word. Word order and grammatical inflections are not used in topical context. In contrast, the local component includes information on word order, distance, and some information about syntactic structure; it includes all tokens (words and punctuation marks) in the immediate vicinity of the target word. Inclusion of a local component is motivated in part by a study that showed that Princeton University 1 Paradigmatic relations refer to the generalization/specialization relations that give instances or examples of related words: e.g., plant, flower, tulip. In contrast, syntagmatic relations define words that frequently co-occur or are used in similar contexts: e.g., flower, garden, hoe.</Paragraph> <Paragraph position="1"> Computational Linguistics Volume 24, Number 1 undergraduates were more accurate at resolving word senses when given complete sentences than when given only an alphabetized list of content words appearing in the sentences (Leacock, Towell, and Voorhees 1996).</Paragraph> <Paragraph position="2"> In this paper, we continue to explore contextual representations by using neural networks to extract both topical and local contexts and combining the results of the two networks into a single word sense classifier. While V6ronis and Ide (1990) also use large neural networks to resolve word senses, their approach is quite different from ours.</Paragraph> <Paragraph position="3"> V6ronis and Ide use a spreading activation algorithm on a network whose structure is automatically extracted from dictionary definitions. In contrast, we use feed-forward networks that learn salient features of context from a set of tagged training examples.</Paragraph> <Paragraph position="4"> Many researchers have used learning algorithms to derive a disambiguation method from a training corpus. For example, Hearst (1991) uses orthographic, syntactic, and lexical features of the target and local context to train on. Yarowsky (1993) and Leacock, Towell, and Voorhees (1996) also found that local context is a highly reliable indicator of sense. However, their results uniformly confirm that all too often there is not enough local information available for the classifiers to make a decision.</Paragraph> <Paragraph position="5"> Gale, Church, and Yarowsky (1992) developed a topical classifier based on Bayesian decision theory. The classifier trains on all and only alphanumeric characters and punctuation strings in the training corpus. Leacock, Towell, and Voorhees (1996), comparing performance of the Bayesian classifier with a vector-space model used in information retrieval systems (Salton, Wong, and Yang 1975) and with a neural network, found that the neural networks had superior performance. Black (1988) trained on high-frequency local and topical context using a method based upon decision trees. While Black's results were encouraging, our attempt to use C4.5 (a decision-tree algorithm \[Quinlan 1992\]) on the topical encoding of line were uniformly disappointing (Leacock, Towell, and Voorhees 1993).</Paragraph> <Paragraph position="6"> The efficacy of our classifier is tested on three words, each a highly polysemous instance of a different part of speech: the noun line, the verb serve, and the adjective hard. The senses tested for each word are listed in Table 1. We restrict the test to senses within a single part of speech to focus the work on the harder part of the disambiguation problem--the accuracy of simple stochastic part-of-speech taggers such as Brill's (Brill 1992) suggests that distinguishing among senses with different parts of speech can readily be accomplished. The data set we use is identical to that of Leacock, Chodorow and Miller (this volume) with two exceptions. First, we do not use part-of-speech tags. Second, we use exactly the same number of examples for each sense.</Paragraph> <Paragraph position="7"> To create data sets with an equal number of examples of each sense, we took the complete set of labeled examples for a word and randomly subsampled it so that all senses occurred equally often in our subsample. This meant that all examples of the least frequent sense appeared in every subsample. We repeated this procedure three times for each word. The same three subsamples were used in all of the experiments reported below. Analysis of variance studies have never detected a statistically significant difference between the subsamples.</Paragraph> <Paragraph position="8"> We used the same number of examples of each sense to eliminate any confounding effects of occurrence frequency. We do this because the frequency with which different senses occur in a corpus varies depending on the corpus type (the Wall Street Journal has many more instances of the 'product line' sense of line than other senses of line, for example) and can be difficult to estimate. Using an equal number of examples per sense makes the problem more challenging than it is likely to be in practice. For example, Yarowsky (1993) has demonstrated that exploiting frequency information can improve disambiguation accuracy. Indeed, if we had retained all examples of the Towell and Voorhees Disambiguating Highly Ambiguous Words Table 1 Word senses used in this study.</Paragraph> <Paragraph position="9"> serve - verb hard - adjective line - noun supply with food not easy (difficult) product hold an office not soft (metaphoric) phone function as something not soft (physical) text provide a service cord division formation Number of Occurrences of the Least Frequent Sense 350 350 349 'product' sense of line from the Wall Street Journal, then we could have improved upon the results presented in the next section by simply always guessing 'product'.</Paragraph> </Section> <Section position="4" start_page="128" end_page="130" type="metho"> <SectionTitle> 4. Neural-Network-based Sense Disambiguation </SectionTitle> <Paragraph position="0"> This section summarizes a series of experiments that tests whether neural networks can extract sufficient information from sample usages to accurately resolve word senses.</Paragraph> <Paragraph position="1"> We choose neural networks as the learning method for this study because our previous work has shown neural networks to be more effective than several other methods of sense disambiguation (Leacock, Towell, Voorhees 1996). Moreover, there is ample empirical evidence which indicates that neural networks are at least as effective as other learning systems on most problems (Shavlik, Mooney, and Towell 1991; Atlas et al. 1989). The major drawback to neural networks is that they may require a large amount of training time. For our purposes, training time is not an issue, since it may be done off-line. However, the time required to classify an example is significant. Because of its complexity, our approach will almost certainly be slower than methods such as decision trees. Still, the time to classify an example will most likely be dominated by the time required to transform an example into the appropriate format for input to the classifier. This time will be roughly uniform across classification strategies, so the difference in the speed of the various classification methods themselves should be unnoticable.</Paragraph> <Paragraph position="2"> The first part of the section presents learning curves that plot accuracy versus the number of samples in the training set for each of: topical context only, local context only, and a combination of topical and local contexts. The curves demonstrate that the classifiers are able to distinguish among the different senses. Further, the results show that the combined classifier (i.e., a classifier that uses both topical and local contexts) is at least as good as, and is usually significantly better than, a classifier that uses only a single component. The second subsection is motivated by the observation that it is unlikely that a real-world training set will contain examples of all possible senses of a word. Hence, we investigate the effect on classification accuracy of senses that are missing from the labeled training set. For this investigation, we slightly modify the classification procedure to allow a do not know response. With this modification the method rejects unknown samples at rates significantly better than chance; this modification also tends to reject examples that would have been misclassified by the unmodified classifier. Since labeled training data are rare and expensive, the final subsection describes SULU, a method for learning accurate classifiers from a small Computational Linguistics Volume 24, Number 1 number of labeled training examples plus a larger number of unlabeled examples.</Paragraph> <Paragraph position="3"> Experimental results demonstrate that SULU consistently and significantly improves classification accuracy when there are few labeled training examples.</Paragraph> <Section position="1" start_page="129" end_page="130" type="sub_section"> <SectionTitle> 4.1 Asymptotic Accuracy </SectionTitle> <Paragraph position="0"> All of the neural networks used here are strictly feed-forward (Rumelhart, Hinton, and Williams 1986). By this, we mean that there is a set of input units that receive activation only from outside the network. The input units pass their activation on to hidden units via weighted links. The hidden units, in turn, pass information on to either additional hidden units or to output units. There are no recurrent links; that is, the activation sent by a unit can never, even through a series of intermediaries, be an input to that unit.</Paragraph> <Paragraph position="1"> Units that are not input units receive activation only via links. The non-input 1 where x is the sum of the incoming activations units compute the function y - l+e-x weighted by the links and y is the output activation. (The translation of words into numbers so that this formula can be applied to word sense disambiguation is described in the following paragraphs.) This nonlinear function has the effect of squashing the input into the range \[0... 1\]. Output units give the answer for our networks. Finally, the activation of the output units is normalized so that their sum is 1.0.</Paragraph> <Paragraph position="2"> In all of the experiments reported below, the weights on all the links are initially set to random numbers taken from a uniform distribution over \[-0.5... 0.5\]. The networks are then trained using gradient descent algorithms (e.g., backpropagation \[Rumelhart, Hinton, and Williams 1986\]) so that the activation of the output units is similar to some desired pattern. Networks are trained until either each example has been presented to the network 100 times or at least 99.5% of the training patterns are close enough to the desired pattern that they would be considered correct. (The meaning of &quot;correct&quot; will vary in our experiments, it will be clearly defined in each experiment.) In practice, the second stopping criterion always obtained.</Paragraph> <Paragraph position="3"> The networks used in most of this work have a very simple structure: one output trait per sense, one input unit per token (the meaning of &quot;token&quot; differs between local and topical networks as described below), and no hidden units. For both local and topical encodings, we tested many hidden unit structures, including ones with many layers and ones with large numbers of hidden units in a single layer. However, with one exception described below, a structure with no hidden units consistently yields the best results. Input units are completely connected to the output units; that is, every input unit is linked to every output unit. During training, the activation of the output trait corresponding to the correct sense has a target value of 1.0, the other outputs have a target value of 0.0. During testing, the sense reported by the network is the output unit with the largest activation. An example is considered to be classified correctly if the sense reported by the network is the same as the tagged sense.</Paragraph> <Paragraph position="4"> In networks that extract topical context, the number of input units is equal to the number of tokens that appear three or more times in the training set, where a token is the string remaining after text processing. The text processing includes removing capitalization, conflating words with common roots, and removing a set of high-frequency words called stopwords. To encode an example (an example is the sentence containing the target word and usually the preceding sentence) for the network, it is tokenized and the input units associated with the resulting tokens are set to 1.0 (regardless of the frequency with which the tokens appear in the example). All other input units are set to 0.0. We investigated many alternatives: both higher and lower bounds on the frequency of occurrence in the set of examples, including stopwords, and using frequency of occurrence. None of these changes has a significant impact on accuracy.</Paragraph> <Paragraph position="5"> The effect of increasing example size on the number of input units needed for encoding. For each of our three data sets and each encoding method, this figure shows the number of input units required to encode the examples. Except for the endpoints, which use the entire example set, each point is the average of 11 random selections from the population of examples.</Paragraph> <Paragraph position="6"> To encode an example for a network that extracts local context, each token (word or punctuation) is prepended with its position relative to the target adding padding as necessary. For example, given the sentence &quot;John serves loyally.&quot;, the target serves, and a desire to use three tokens on either side of the target, the input to the network is \[-3zzz -2zzz -1John 0serves 1loyally 2. 3zzz\] where &quot;zzz&quot; is added as a blank as needed. Networks contain input units representing every resulting string within three of the target word in the set of labeled training examples. Note that this implies that there will be words in positions in the test set that are not matched in the training set.</Paragraph> <Paragraph position="7"> So, while training examples will have exactly seven input units with a value of 1.0, testing examples will have at most seven input units with a value of 1.0. The window we use is slightly wider than a window of two words on either side that experiments with humans suggest is sufficient (Choueka and Lusignan 1985). The human study counted only words, whereas we count both words and punctuation. Our networks are significantly less accurate using windows smaller than three tokens on either side.</Paragraph> <Paragraph position="8"> On the other hand, wider windows are slightly, but not statistically significantly, more accurate.</Paragraph> <Paragraph position="9"> Figure 1 shows that the topical and local encoding methods result in large input sets. For example, when the entire population of line examples is used, the local encoding would require 3,973 input units and the topical encoding would require 2,924 inputs. Fortunately, this figure shows that the rate of increase in the size of the input set steadily decreases as the size of the input set increases. Fitting each of the lines in this figure against exponential functions indicates that none of these data sets would grow to require more than 9,000 inputs units. While this is certainly large, it is tolerable.</Paragraph> <Paragraph position="10"> We investigated many ways of combining the output of the topical and local networks. We report results for a method that takes the maximum of the sum of the output units. 2 For example, suppose that a local network for disambiguating the three senses of hard has outputs of (0.4 0.5 0.1) and a topical network has outputs of (0.4</Paragraph> </Section> </Section> <Section position="5" start_page="130" end_page="142" type="metho"> <SectionTitle> 2 Among the many alternatives we investigated for merging the local and topical networks, only one </SectionTitle> <Paragraph position="0"> yields slightly better results. It is based upon Wolpert's stacked generalization (Wolpert 1992). In this technique, the outputs from the topical and local networks are passed into another network whose function is simply to learn how to combine the outputs. When the input to the combining network is the concatenation of the inputs and the outputs of both the local and topical networks, the combining network often outperforms our summing method. However, the improvement is usually not statistically significant, so we report only the results from the considerably simpler summing method.</Paragraph> <Paragraph position="1"> Computational Linguistics Volume 24, Number 1 0.0 0.6). Then the local information would suggest the second sense, while the topical information would suggest the third sense. The summing strategy yields (0.8 0.5 0.7), so the combined classifier would select the first sense.</Paragraph> <Paragraph position="2"> The only approach we have found that consistently, and statistically significantly, outperforms the strategy described above is based upon error-correcting output encoding (Kong and Dietterich 1995). The idea of error-correcting codes is to learn all possible dichotomies of the set of classifications. For example, given a problem with four classes, A, B, C, and D, learn to distinguish A and B from C and D; A from B, C, and D; etc. The major problem with this method is that it can be computationally intensive when there are many output classes because there are 2 s-1 - 1 dichotomies for S output classes. We implemented error-correcting output codes by independently training a network with one output unit and 10 hidden units to learn each dichotomy.</Paragraph> <Paragraph position="3"> we built learning curves using several values of N in N-fold cross-validation. In crossvalidation, the data set is randomly divided into N equal parts, then N - 1 parts are used for training and the remaining part is held aside to assess generalization. 3 As a result, each example is used for training N - 1 times and once to assess generalization. A drawback of N-fold cross-validation is that it cannot test small portions of the data. So, for points on the learning curves that use less than 50% of the training data, we invert the cross validation procedure, using one of the N parts for training and the remaining N - 1 parts to assess generalization. For example, if there are 100 labeled examples, each iteration of 10-fold cross-validation would use 90 examples for training and the remaining 10 for testing. When complete, each example would be used for training exactly nine times and exactly once for testing. By contrast, in inverted 10-fold cross-validation, each example is used exactly once for training and exactly nine times for testing.</Paragraph> <Paragraph position="4"> in the figure represents an average over 11 cross-validation trials. Thus, the point for 75% of the training set, which corresponds to 4-fold cross-validation, requires training 44 networks. The confusion matrices in Tables 2 to 4 give the complete data for the largest training sets of the &quot;standard&quot; curves in Figure 2. Rows in the table represent the correct answer, and columns represent the answer selected by the classifier. &quot;Total&quot; gives the number of times the classifier selects the given sense. &quot;Precision&quot; is the percentage of that total that is correct. In contrast, &quot;Percent Correct&quot; gives the accuracy of the classifier over the set of hand-tagged examples of the given sense.</Paragraph> <Paragraph position="5"> Figure 2 shows that, for line and serve, the combined classifier is considerably superior to either the local or topical classifier at all training-set sizes. At the largest training-set size, the combined classifier is superior with at least 99.5% confidence according to a one-tailed paired-sample t-test. There is no advantage for the combined classifier for hard. In fact, at the largest training-set size, the local classifier slightly out-performs the combined classifier. The difference, while small, is statistically significant with 97.5% confidence according to a one-tailed paired-sample t-test.</Paragraph> <Paragraph position="6"> An obvious reason, as can be seen in Figure 2, for why the combined representation fails to improve classification effectiveness for hard is that the topical classifier 3 By randomly separating examples, it is possible that examples taken from the same document appear in the training and testing sets. In principle, this could inflate the accuracy we report for the classifier. In practice, experiments that explicitly control for this effect do not yield significantly different results from those we report.</Paragraph> <Paragraph position="7"> Learning curves for classifiers that use local context only (Local), topical context only (Topical), and a combination of local and topical contexts (Combined) for hard, serve, and line. Each point in each curve represents an average over 11 repetitions of N-fold cross-validation. The points on each of these curves represent 10-, 6-, 4-, 3-, and 2-fold cross-validation and 3-, 4-, and 6-fold inverted cross-validation. Error-correcting codes have results at only 2-, 3-, and 4-fold cross-validation (i.e, 50%, 66%, and 75% of the training data.) is much worse than the local classifier. While the differences in accuracy between the topical and local classifiers are statistically significant with at least 99.5% confidence according to a one-tailed paired-sample t-test on all three senses, the accuracies are more similar for both serve and line than they are for hard. This obvious difference in accuracy is only part of the reason why the combined classifier is less effective for hard. Perhaps more important is the fact that the errors for the local and topical classifiers validation. Rows in the table represent the correct answer, columns represent the answer given by the classifier. So, in the local table, 22.1 is the average number of times that the classifier selected the 'physical' sense when the correct sense was the 'difficult' sense.</Paragraph> <Paragraph position="8"> There are 350 examples of each class.</Paragraph> <Paragraph position="9"> are more highly correlated for hard than for either line or serve. The average correlation of correct and incorrect answers for local and topical classifiers of hard is 0.14 while the average correlations for line and serve, respectively, are 0.07 and -0.02. Many efforts at using ensembles of classifiers have reported that to get significant improvements, the members of the ensemble should be as uncorrelated as possible (Paramanto, Munro, and Doyle 1996). Given the correlation between the local and topical classifiers for hard, it is not surprising that the combined classifier provides no additional benefit. The confusion matrices in Tables 2 to 4 provide another piece of evidence about file failure of the combined representation for hard. Consider the pattern of responses h)r the 'provide food' sense of serve by the local and topical classifiers (Table 3). In particular, notice that the local classifier is much more likely to select the 'provide food' sense (26.1) when the correct sense is 'provide a service' than it is when the correct sense is 'function as' (6.7). Conversely, the topical classifier is more likely to select the 'provide food' sense when the correct sense is 'function as' (30.6) than when the correct sense is 'provide a service' (5.8). When these tendencies are combined they essentially offset each other. As a result, the combined classifier is unlikely to select Towell and Voorhees Disambiguating Highly Ambiguous Words Table 3 Average confusion matrices for serve over 11 runs of 10-fold cross validation.</Paragraph> <Paragraph position="10"> Rows in the table represent the correct answer, columns represent the answer given by the classifier. So, in the local table, 6.7 is the average number of times that the classifier selected the 'food' sense when the correct sense was the 'function as' sense. There are 350 examples of each class.</Paragraph> <Paragraph position="11"> Local-serve function as service food office Percent Correct function as 293.7 22.9 6.7 26.6 83.9% the 'provide food' sense when the correct sense is either 'function as' (5.8) or 'provide a service' (5.7).</Paragraph> <Paragraph position="12"> This pattern of offsetting errors is repeated on all but one of the senses of line and serve. Wherever it occurs, the combined classifier is superior to both the local and topical classifiers. By contrast, errors for the local and topical classifiers of hard (Table 2) rarely offset each other. Only for the 'physical' sense do the errors offset, and that is the only sense on which the combined classifier outperforms the local classifier. For the 'difficult' and 'metaphoric' senses of hard, the erroneous selections made by the topical classifier are strictly greater than those of the local classifier. Similarly, the selection of the local classifier for the 'cord' sense of line is strictly worse than the selections of the topical classifier. In each of these cases, the effect of the combined classifier is to roughly average the errors of the local and topical classifiers, with the result that the combined classifier has more errors than the better of the local and topical classifiers. Unfortunately, the number of errors introduced by the 'difficult' and 'metaphoric' senses of hard more than offset the errors eliminated by the 'physical' sense. Therefore, the local classifier for hard is more accurate than the combined classifier.</Paragraph> <Paragraph position="13"> The topical classifier outperforms the local classifier for the noun line (Figure 2). Conversely, the local classifier outperforms the topical classifier for the verb serve and the adjective hard. While we hesitate to draw many conclusions from this pattern on the basis of so little data, the pattern is consistent with other observations. Yarowsky (1993) suggests that the sense of an adjective is almost wholly determined by the noun it modifies. If this suggestion is correct, then the added information in Towell and Voorhees Disambiguating Highly Ambiguous Words the topical representation should add only confusion. Hence, one would expect to see the local classifier outperforming the topical classifier for all adjectives. Similarly, some verb senses are determined largely by their direct object. For example, the 'provide a service' sense of serve almost always has a thing as a direct object, while the 'function as' sense of serve almost always has a person. The added information in the topical encoding may obscure this difference, thereby adding to the difficulty of correctly disambiguating these senses. So, we would not be surprised to see the advantage of local representations over topical representations continue on other verbs and adjectives.</Paragraph> <Paragraph position="14"> Many techniques for using local context explicitly use diagnostic phrases, such as wait in line, for the formation sense of line. In previous work, we took exactly this approach and showed that diagnostic phrases could be used to improve the accuracy of a topical classifier (Leacock, Towell, and Voorhees 1996). Our neural network for local disambiguation differs considerably from this approach. Specifically, it is unable to learn more than one diagnostic phrase per sense because it lacks hidden units.</Paragraph> <Paragraph position="15"> In fact, the network does not learn a single diagnostic phrase. Instead, it learns that certain words in certain positions are indicative of certain senses. While this might appear to be a significant handicap, we have been unable to train a network that is capable of learning phrases so that it outperforms our networks. In addition, while they lack the ability to learn phrases, our local classifiers are, nonetheless, quite effective at determining the correct sense. It is our belief that hidden units would be useful for learning local context given a sufficient amount of training data. However, there are currently far more free parameters in our networks than there are examples to constrain those parameters. Until there are more constraints, we do not believe that hidden units will be useful for sense disambiguation.</Paragraph> <Paragraph position="16"> Finally, it is interesting to note that not all senses are equally easy, and that different classifiers find different senses easier than others. For example, in Table 2 the most difficult sense of hard for the local classifier is the 'physical' sense, but this is the easiest sense for both the topical and combined classifiers. On the other hand, some senses are just difficult. The 'text' sense of line (Table 4) is among the hardest for all classifiers. We believe that the 'text' sense is difficult because it often contains quoted material which may distract from the meaning of line. However, the quoted material is often too far away from the target word for the quotation marks to be seen in the local window. As a result, the topical classifier is confused by distracting material and the local classifier does not see the most salient feature.</Paragraph> <Section position="1" start_page="136" end_page="138" type="sub_section"> <SectionTitle> 4.2 Senses Missing from Data </SectionTitle> <Paragraph position="0"> The results in the previous section suggest that, given a sufficiently large number of labeled examples, it is possible to combine topical and local representations into an effective sense classifier. Those results, however, assume that the labeled examples include all possible senses of the word to be disambiguated. Senses not included in the training set will be misclassified because the procedure assigns a sense to every example. In this section, we allow the system to respond do not know to address the issue of senses not seen during training.</Paragraph> <Paragraph position="1"> In the previous section, the sense selected by the network is the sense corresponding to the output unit with the largest activation. If the output units are known to represent all possible senses, then this is a reasonable procedure. If, however, there is reason to believe that there may be other senses, then this procedure imparts a strong, and incorrect, bias to the classification step. When there is reason to believe that there are senses in the data that are not represented in the training set, we can relax this bias by using the sense selected by the largest output activation only if that activation is greater than a threshold. When the maximum activation is below the threshold, the network's response is do not know.</Paragraph> <Paragraph position="2"> The logic underlying this modification is that the activation of the output unit corresponding to the correct answer tends to be close to 1.0 when the instance to be classified is similar to a training example. Hence, instances of senses seen during training should have an output unit whose activation is close to 1.0 (assuming that the training examples adequately represent the set of possibilities). On the other hand, instances of senses not seen during training are unlikely to be similar to any training example. So, they are unlikely to generate an activation that is close to 1.0.</Paragraph> <Paragraph position="3"> hypothesis that we can detect unknown senses by screening for examples that have a low maximum output activation. Our procedure is as follows: networks are trained using 90% of the examples of S - 1 senses when there are S senses for a target word. The trained network is then tested using the unused 10% of the S - 1 classes seen during training and 10 percent of the examples of the class not seen during training (selected randomly). In addition, during testing the network is given a threshold value to determine whether or not to label the example. Figure 3 shows the effect of varying the threshold from 0.4 to 1.0 (values below 0.4 were tried but had no effect) using the combined classifier. The leave-one-category-out procedure was repeated 11 times for each sense.</Paragraph> <Paragraph position="4"> of the classifier has the hypothesized effect. Not surprisingly, the number of examples classified always decreases as the threshold increases. Also expected is that the percentage of correctly rejected examples falls as the threshold increases--increasing the threshold naturally catches more examples that should be accepted. (A rejected example is one for which the classifier responds do not know.) The up-tick in the proper rejection rate at high thresholds for line is not significant. Of more interest is that the classifier is always significantly better than chance at correctly rejecting examples. The chance rate of correct rejection is shown in Table 5. Thus, the modification allows the classifier to identify senses that do not appear in the training set.</Paragraph> <Paragraph position="5"> Figure 3 also shows that the threshold has the unanticipated benefit of rejecting misclassified examples of known senses. Hence, it may be desirable to use a threshold even when all senses of a word are represented in the training set. The exact level of the threshold is a matter of choice: a low threshold admits more errors but rejects fewer examples, while higher thresholds are more accurate but classify fewer examples.</Paragraph> <Paragraph position="6"> Figure 3 The effect of omitting one sense from the training set. In each figure, the X-axis represents the level of a threshold. If the maximum output activation is below the threshold then the network responds do not know. &quot;Correct (only known senses)&quot; gives the accuracy of the combined classifier on senses seen during training. &quot;Correct (all)&quot; gives the accuracy over all examples. &quot;Properly rejected&quot; is the percentage of all examples for which the classifier responds do not know that are either in a novel sense or would have been misclassified. Finally, &quot;Classified&quot; gives the percentage of the data for which the classifier assigns a sense.</Paragraph> </Section> <Section position="2" start_page="138" end_page="142" type="sub_section"> <SectionTitle> 4.3 Using Small Amounts of Labeled Data </SectionTitle> <Paragraph position="0"> All of the above results have assumed that there exist a large number of hand-labeled examples to use during training. Unfortunately, this is not likely to be the case. Rather than working with a number of labeled examples sufficient to approach an asymptotic level of accuracy, the classifiers are likely to be working with a number of labeled For each e in E if random(O,lO0) > B then e <- SYNTHESIZE (e, E, U, random (2, M) ) TRAIN N using e Until a stopping criterion is reached SYNTHESIZE(e,E,U,m): Let: C /* will hold a collection of examples */ For i from I to m c <- ith nearest neighbor of e in E union U if.((c is labeled) and (label of c not equal to label of e)) then STOP if c is not labeled cc <- nearest neighbor of c in E if label of cc not equal to label of e then STOP add c to C return an example whose input is the centroid of the inputs of the examples in C and has the class label of e. Figure 4 Pseudocodefor SULU.</Paragraph> <Paragraph position="1"> examples barely sufficient to get them started on the learning curve. While labeled examples will likely always be rare, unlabeled text is already available in huge quantities. Theoretical results (Castelli and Cover 1995) suggest that it should be possible to use both labeled and unlabeled examples to produce a classifier that is more accurate than one based on only labeled examples. We describe an algorithm, SULU (Supervised learning Using Labeled and Unlabeled examples), that uses both labeled and unlabeled examples and provide empirical evidence of the algorithm's effectiveness (Towell 1996).</Paragraph> <Paragraph position="2"> techniques except that it may replace a labeled example with a synthetic example. A synthetic example is a point constructed from the nearest neighbors of a labeled example. The criterion to stop training in SULU is also slightly modified to require that the network correctly classify almost every labeled example and a majority of the synthetic examples. For instance, the experiments reported below generate synthetic examples 50% of the time; the stopping criterion requires that 80% of the examples seen in a single pass through the training set (an epoch) are classified correctly. Figure 4 shows pseudocode for the SULU algorithm. The synthesize function describes the process through which an example is synthesized. Given a labeled example to use as a seed, synthesize collects neighboring examples and returns an example that is the centroid of the collected examples with the label of the starting point. Synthesize collects neighboring examples until reaching one of the following three stopping points. First, the maximum number of points is reached: the goal of SULU is Towell and Voorhees Disambiguating Highly Ambiguous Words to get information about the local variance around known points, this criterion guarantees locality. Second, the next closest example to the seed is a labeled example with a different label: this criterion prevents the inclusion of obviously incorrect information in synthetic examples. Third, the next closest example to the seed is an unlabeled example and the closest labeled example to that unlabeled example has a different label from the seed: this criterion is intended to detect borders between classification areas in example space.</Paragraph> <Paragraph position="3"> data set. First, the data are split into three sets, 25% is set aside to be used for assessing generalization, 50% is stripped of sense labels, and the remaining 25% is used for training. To create learning curves, the training set is further subdivided into sets of 5%, 10%, 15%, 20%, and 25% of the data, such that smaller sets are always subsets of larger sets. Then, a single neural network (of the structure described in Section 4.1) is created and copied 25 times. At each training-set size, a new copy of the network is trained under each of the following conditions: (1) using SULU, (2) using SULU but supplying only the labeled training examples to synthesize, (3) standard network training, (4) using a re-implementation of an algorithm proposed by Yarowsky (1995), and (5) using standard network training but with all training examples labeled to establish an upper bound. This procedure is repeated 11 times to average out the effects of example selection and network initialization.</Paragraph> <Paragraph position="4"> Yarowsky's algorithm expands the region of known, labeled examples out from a small set of hand-labeled seed collocations. Our instantiation of Yarowsky's algorithm differs from the original in three ways. First, we use neural networks whereas Yarowsky uses decision lists. This difference is almost certainly not significant; in describing his algorithm, Yarowsky notes that a neural network could be used in place of decision lists. Second, we omit the application of the one-sense-per-discourse heuristic, as our examples are not part of a larger discourse. This heuristic could be equally applied to SULU, so eliminating this heuristic from Yarowsky's algorithm places the algorithms on an equal base. Finally, we randomly pick the initially labeled contexts. The effect of this difference could be significant. However, this difference would affect our system as well as Yarowsky's, so it should not invalidate our comparison. When SULU is used, synthetic examples replace labeled examples 50% of the time. Networks using the full SULU (condition i above) are trained until 80% of the examples in a single epoch are correctly classified. All other networks are trained until at least 99.5% of the examples are correctly classified.</Paragraph> <Paragraph position="5"> bined classifier for each algorithm on each of our three target words. SULU always resuits in a statistically significant improvement over the standard neural network with at least 97.5% confidence (according to a one-tailed paired-sample t-test). Interestingly, SULU'S improvement is consistently between 1/4 and 1/2 of that achieved by labeling the unlabeled examples. This result contrasts with Castelli and Cover's (1995) analysis that suggests that labeled examples are exponentially more valuable than unlabeled examples.</Paragraph> <Paragraph position="6"> SULU is consistently and significantly superior to our version of Yarowsky's algorithm when there are few labeled examples. As the number of labeled examples increases the advantage of SULU decreases. At the largest training-set sizes tested, the two systems are roughly equally effective.</Paragraph> <Paragraph position="7"> A possible criticism of SULU is that it does not actually need the unlabeled examples; the procedure may be as effective using only the labeled training data. This The effect of five training procedures on the target words. In each of the above graphs, the effect of standard neural learning has been subtracted from all results to suppress the increase in accuracy that results simply from an increase in the number of labeled training examples. Observations marked by a &quot;o&quot; or a &quot;+&quot;, respectively, indicate that the point is statistically significantly inferior or superior to a network trained using SULU.</Paragraph> <Paragraph position="8"> hypothesis is incorrect. As shown in Figure 5, when SULU is given no unlabeled examples it is consistently and significantly inferior to SULU when it is given a large number of unlabeled examples. In addition, sugu with no unlabeled examples is consistently, although not always significantly, inferior to a standard neural network (data not shown).</Paragraph> <Paragraph position="9"> An indication that there is room for improvement in SULU is the difference in Towell and Voorhees Disambiguating Highly Ambiguous Words generalization between sucu and a network trained using data in which the unlabeled examples provided to SULU have labels (condition 5 above). On every data set, the gain from labeling the examples is statistically significant. The accuracy of a network trained with all labeled examples is an upper bound for SULU, and one that is likely not reachable. However, the distance between this upper bound and SULU'S current performance indicates that there is room for improvement.</Paragraph> </Section> </Section> class="xml-element"></Paper>