XML Viewer - a00-2009

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/a00-2009_metho.xml
Size: 23,250 bytes
Last Modified: 2025-10-06 14:07:01
<?xml version="1.0" standalone="yes"?>
<Paper uid="A00-2009">
  <Title>A Simple Approach to Building Ensembles of Naive Bayesian Classifiers for Word Sense Disambiguation</Title>
  <Section position="3" start_page="0" end_page="63" type="metho">
    <SectionTitle>
2 Naive Bayesian Classifiers
</SectionTitle>
    <Paragraph position="0"> A Naive Bayesian classifier assumes that all the feature variables representing a problem are conditionally independent given the value of a classification variable. In word sense disambiguation, the context in which an ambiguous word occurs is represented by the feature variables (F1, F2,..., F~) and the sense of the ambiguous word is represented by the classification variable (S). In this paper, all feature variables Fi are binary and represent whether or not a particular word occurs within some number of words to the left or right of an ambiguous word, i.e., a win- null dow of context. For a Naive Bayesian classifier, the joint probability of observing a certain combination of contextual features with a particular sense is expressed as:</Paragraph>
    <Paragraph position="2"> The parameters of this model are p(S) and p(Fi\]S). The sufficient statistics, i.e., the summaries of the data needed for parameter estimation, are the frequency counts of the events described by the interdependent variables (Fi, S). In this paper, these counts are the number of sentences in the sense-tagged text where the word represented by Fi occurs within some specified window of context of the ambiguous word when it is used in sense S.</Paragraph>
    <Paragraph position="3"> Any parameter that has a value of zero indicates that the associated word never occurs with the specified sense value. These zero values are smoothed by assigning them a very small default probability.</Paragraph>
    <Paragraph position="4"> Once all the parameters have been estimated, the model has been trained and can be used as a classifier to perform disambiguation by determining the most probable sense for an ambiguous word, given the context in which it occurs.</Paragraph>
    <Section position="1" start_page="63" end_page="63" type="sub_section">
      <SectionTitle>
2.1 Representation of Context
</SectionTitle>
      <Paragraph position="0"> The contextual features used in this paper are binary and indicate if a given word occurs within some number of words to the left or right of the ambiguous word. No additional positional information is contained in these features; they simply indicate if the word occurs within some number of surrounding words.</Paragraph>
      <Paragraph position="1"> Punctuation and capitalization are removed from the windows of context. All other lexical items are included in their original form; no stemming is performed and non-content words remain.</Paragraph>
      <Paragraph position="2"> This representation of context is a variation on the bag-of-words feature set, where a single window of context includes words that occur to both the left and right of the ambiguous word. An early use of this representation is described in (Gale et al., 1992), where word sense disambiguation is performed with a Naive Bayesian classifier. The work in this paper differs in that there are two windows of context, one representing words that occur to the left of the ambiguous word and another for those to the right.</Paragraph>
    </Section>
    <Section position="2" start_page="63" end_page="63" type="sub_section">
      <SectionTitle>
2.2 Ensembles of Naive Bayesian Classifiers
</SectionTitle>
      <Paragraph position="0"> The left and right windows of context have nine different sizes; 0, 1, 2, 3, 4, 5, 10, 25, and 50 words.</Paragraph>
      <Paragraph position="1"> The first step in the ensemble approach is to train a separate Naive Bayesian classifier for each of the 81 possible combination of left and right window sizes.</Paragraph>
      <Paragraph position="2"> Naive_Bayes (1,r) represents a classifier where the model parameters have been estimated based on Irequency counts of shallow lexical features from two windows of context; one including I words to the left of the ambiguous word and the other including r words to the right. Note that Naive_Bayes (0,0) includes no words to the left or right; this classifier acts as a majority classifier that assigns every instance of an ambiguous word to the most frequent sense in the training data. Once the individual classifiers are trained they are evaluated using previously held-out test data.</Paragraph>
      <Paragraph position="3"> The crucial step in building an ensemble is selecting the classifiers to include as members. The approach here is to group the 81 Naive Bayesian classifiers into general categories representing the sizes of the windows of context. There are three such ranges; narrow corresponds to windows 0, 1 and 2 words wide, medium to windows 3, 4, and 5 words wide, and wide to windows 10, 25, and 50 words wide. There are nine possible range categories since there are separate left and right windows. For example, Naive_Bayes(1,3) belongs to the range category (narrow, medium) since it is based on a one word window to the left and a three word window to the right. The most accurate classifier in each of the nine range categories is selected for inclusion in the ensemble. Each of the nine member classifiers votes for the most probable sense given the particular context represented by that classifier; the ensemble disambiguates by assigning the sense that receives a majority of the votes.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="63" end_page="64" type="metho">
    <SectionTitle>
3 Experimental Data
</SectionTitle>
    <Paragraph position="0"> The line data was created by (Leacock et al., 1993) by tagging every occurrence of line in the ACL/DCI Wall Street Journal corpus and the American Printing House for the Blind corpus with one of six possible WordNet senses. These senses and their frequency distribution are shown in Table 1. This data has since been used in studies by (Mooney, 1996), (Towell and Voorhees, 1998), and (Leacock et al., 1998). In that work, as well as in this paper, a subset of the corpus is utilized such that each sense is uniformly distributed; this reduces the accuracy of the majority classifier to 17%. The uniform distribution is created by randomly sampling 349 sense-tagged examples from each sense, resulting in a training corpus of 2094 sense-tagged sentences.</Paragraph>
    <Paragraph position="1"> The interest data was created by (Bruce and Wiebe, 1994) by tagging all occurrences of interest in the ACL/DCI Wall Street Journal corpus with senses from the Longman Dictionary of Contemporary English. This data set was subsequently used for word sense disambiguation experiments by (Ng and Lee, 1996), (Pedersen et al., 1997), and (Pedersen and Bruce, 1997). The previous studies and this paper use the entire 2,368 sense-tagged sentence corpus in their experiments. The senses and their ire- null sense count product 2218 written or spoken text 405 telephone connection 429 formation of people or things; queue 349 an artificial division; boundary 376 a thin, flexible object; cord 371 total 4148  iments in this paper and previous work use a uniformly distributed subset of this corpus, where each sense occurs 349 times.</Paragraph>
    <Paragraph position="2"> sense count money paid for the use of money 1252 a share in a company or business 500 readiness to give attention 361 advantage, advancement or favor 178 activity that one gives attention to 66 causing attention to be given to 11 total 2368  periments in this paper and previous work use the entire corpus, where each sense occurs the number of times shown above.</Paragraph>
    <Paragraph position="3"> quency distribution are shown in Table 2. Unlike line, the sense distribution is skewed; the majority sense occurs in 53% of the sentences, while the smallest minority sense occurs in less than 1%.</Paragraph>
  </Section>
  <Section position="5" start_page="64" end_page="66" type="metho">
    <SectionTitle>
4 Experimental Results
</SectionTitle>
    <Paragraph position="0"> Eighty-one Naive Bayesian classifiers were trained and tested with the line and interest data. Five-fold cross validation was employed; all of the sense-tagged examples for a word were randomly shuffled and divided into five equal folds. Four folds were used to train the Naive Bayesian classifier while the remaining fold was randomly divided into two equal sized test sets. The first, devtest, was used to evaluate the individual classifiers for inclusion in the ensemble. The second, test, was used to evaluate the accuracy of the ensemble. Thus the training data for each word consists of 80% of the available sense-tagged text, while each of the test sets contains 10%. This process is repeated five times so that each fold serves as the source of the test data once. The average accuracy of the individual Naive Bayesian classifiers across the five folds is reported in Tables 3 and 4. The standard deviations were between .01 and .025 and are not shown given their relative consistency. null  Each classifier is based upon a distinct representation of context since each employs a different combination of right and left window sizes. The size and range of the left window of context is indicated along the horizontal margin in Tables 3 and 4 while the right window size and range is shown along the vertical margin. Thus, the boxes that subdivide each table correspond to a particular range category. The classifier that achieves the highest accuracy in each range category is included as a member of the ensemble. In case of a tie, the classifier with the smallest total window of context is included in the ensemble.</Paragraph>
    <Paragraph position="1"> The most accurate single classifier for line is Naive_Bayes (4,25), which attains accuracy of 84% The accuracy of the ensemble created from the most accurate classifier in each of the range categories is 88%. The single most accurate classifier for interest is Naive._Bayes(4,1), which attains accuracy of 86% while the ensemble approach reaches 89%. The increase in accuracy achieved by both ensembles over the best individual classifier is statistically significant, as judged by McNemar's test with p = .01.</Paragraph>
    <Section position="1" start_page="65" end_page="66" type="sub_section">
      <SectionTitle>
4.1 Comparison to Previous Results
</SectionTitle>
      <Paragraph position="0"> These experiments use the same sense-tagged corpora for interest and line as previous studies. Summaries of previous results in Tables 5 and 6 show that the accuracy of the Naive Bayesian ensemble is comparable to that of any other approach. However, due to variations in experimental methodologies, it can not be concluded that the differences among the most accurate methods are statistically significant. For example, in this work five-fold cross validation is employed to assess accuracy while (Ng and Lee, 1996) train and test using 100 randomly sampled sets of data. Similar differences in training and testing methodology exist among the other studies. Still, the results in this paper are encouraging due to the simplicity of the approach.</Paragraph>
      <Paragraph position="1">  The interest data was first studied by (Bruce and Wiebe, 1994). They employ a representation of context that includes the part-of-speech of the two words surrounding interest, a morphological feature indicating whether or not interest is singular or plural, and the three most statistically significant co-occurring words in the sentence with interest, as determined by a test of independence. These features are abbreviated as p-o-s, morph, and co-occur in Table 5. A decomposable probabilistic model is induced from the sense-tagged corpora using a backward sequential search where candidate models are evaluated with the log-likelihood ratio test. The selected model was used as a probabilistic classifier on a held-out set of test data and achieved accuracy of 78%.</Paragraph>
      <Paragraph position="2"> The interest data was included in a study by (Ng  accuracies are associated with the classifiers included in the ensemble, which attained accuracy of 88% when evaluated with the test data.</Paragraph>
      <Paragraph position="3">  accuracies are associated with the classifiers included in the ensemble, which attained accuracy of 89% when evaluated with the test data.</Paragraph>
      <Paragraph position="4"> and Lee, 1996), who represent the context of an ambiguous word with the part-of-speech of three words to the left and right of interest, a morphological feature indicating if interest is singular or plural, an unordered set of frequently occurring keywords that surround interest, local collocations that include interest, and verb-object syntactic relationships. These features are abbreviated p-o-s, morph, co-occur, collocates, and verb-obj in Table 5. A nearest-neighbor classifier was employed and achieved an average accuracy of 87% over repeated trials using randomly drawn training and test sets.</Paragraph>
      <Paragraph position="5"> (Pedersen et al., 1997) and (Pedersen and Bruce, 1997) present studies that utilize the original Bruce and Wiebe feature set and include the interest data.</Paragraph>
      <Paragraph position="6"> The first compares a range of probabilistic model selection methodologies and finds that none outperform the Naive Bayesian classifier, which attains accuracy of 74%. The second compares a range of machine learning algorithms and finds that a decision tree learner (78%) and a Naive Bayesian classifier (74%) are most accurate.</Paragraph>
      <Paragraph position="7">  The line data was first studied by (Leacock et al., 1993). They evaluate the disambiguation accuracy of a Naive Bayesian classifier, a content vector, and a neural network. The context of an ambiguous word is represented by a bag-of-words where the window of context is two sentences wide. This feature set is abbreviated as 2 sentence b-o-w in Table 6. When the Naive Bayesian classifier is evaluated words are not stemmed and capitalization remains.</Paragraph>
      <Paragraph position="8"> However, with the content vector and the neural network words are stemmed and words from a stop-list are removed. They report no significant differences in accuracy among the three approaches; the Naive Bayesian classifier achieved 71% accuracy, the content vector 72%, and the neural network 76%.</Paragraph>
      <Paragraph position="9"> The line data was studied again by (Mooney, 1996), where seven different machine learning methodologies are compared. All learning algorithms represent the context of an ambiguous word using the bag-of-words with a two sentence window of context. In these experiments words from a stop- null are stemmed. The two most accurate methods in this study proved to be a Naive Bayesian classifier (72%) and a perceptron (71%).</Paragraph>
      <Paragraph position="10"> The line data was recently revisited by both (Towell and Voorhees, 1998) and (Leacock et al., 1998). The former take an ensemble approach where the output from two neural networks is combined; one network is based on a representation of local context while the other represents topical context. The latter utilize a Naive Bayesian classifier. In both cases context is represented by a set of topical and local features. The topical features correspond to the open-class words that occur in a two sentence window of context. The local features occur within a window of context three words to the left and right of the ambiguous word and include co-occurrence features as well as the part-of-speech of words in this window. These features are represented as local &amp; topical b-o-w and p-o-s in Table 6. (Towell and Voorhees, 1998) report accuracy of 87% while (Leacock et al., 1998) report accuracy of 84%.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="66" end_page="67" type="metho">
    <SectionTitle>
5 Discussion
</SectionTitle>
    <Paragraph position="0"> The word sense disambiguation ensembles in this paper have the following characteristics:  extracted from varying sized windows of surrounding words, * member classifiers are selected for the ensembles based on their performance relative to others with comparable window sizes, and * a majority vote of the member classifiers determines the outcome of the ensemble.</Paragraph>
    <Paragraph position="1"> Each point is discussed below.</Paragraph>
    <Section position="1" start_page="66" end_page="67" type="sub_section">
      <SectionTitle>
5.1 Naive Bayesian classifiers
</SectionTitle>
      <Paragraph position="0"> The Naive Bayesian classifier has emerged as a consistently strong performer in a wide range of comparative studies of machine learning methodologies.</Paragraph>
      <Paragraph position="1"> A recent survey of such results, as well as possible explanations for its success, is presented in (Domingos and Pazzani, 1997). A similar finding has emerged in word sense disambiguation, where a number of comparative studies have all reported that no method achieves significantly greater accuracy than the Naive Bayesian classifier (e.g., (Leacock et al., 1993), (Mooney, 1996), (Ng and Lee, 1996), (Pedersen and Bruce, 1997)).</Paragraph>
      <Paragraph position="2"> In many ensemble approaches the member classitiers are learned with different algorithms that are trained with the same data. For example, an ensemble could consist of a decision tree, a neural network, and a nearest neighbor classifier, all of which are learned from exactly the same set of training data. This paper takes a different approach, where the learning algorithm is the same for all classifiers  but the training data is different. This is motivated by the belief that there is more to be gained by varying the representation of context than there is from using many different learning algorithms on the same data. This is especially true in this domain since the Naive Bayesian classifier has a history of success and since there is no generally agreed upon set of features that have been shown to be optimal for word sense disambiguation.</Paragraph>
    </Section>
    <Section position="2" start_page="67" end_page="67" type="sub_section">
      <SectionTitle>
5.2 Co-occurrence features
</SectionTitle>
      <Paragraph position="0"> Shallow lexical features such as co-occurrences and collocations are recognized as potent sources of disambiguation information. While many other contextual features are often employed, it isn't clear that they offer substantial advantages. For example, (Ng and Lee, 1996) report that local collocations alone achieve 80% accuracy disambiguating interest, while their full set of features result in 87%. Preliminary experiments for this paper used feature sets that included collocates, co-occurrences, part-of-speech and grammatical information for surrounding words. However, it was clear that no combination of features resulted in disambiguation accuracy significantly higher than that achieved with co-occurrence features.</Paragraph>
    </Section>
    <Section position="3" start_page="67" end_page="67" type="sub_section">
      <SectionTitle>
5.3 Selecting ensemble members
</SectionTitle>
      <Paragraph position="0"> The most accurate classifier from each of nine possible category ranges is selected as a member of the ensemble. This is based on preliminary experiments that showed that member classifiers with similar sized windows of context often result in little or no overall improvement in disambiguation accuracy.</Paragraph>
      <Paragraph position="1"> This was expected since slight differences in window sizes lead to roughly equivalent representations of context and classifiers that have little opportunity for collective improvement. For example, an ensemble was created for interest using the nine classifiers in the range category (medium, medium). The accuracy of this ensemble was 84%, slightly less than the most accurate individual classifiers in that range which achieved accuracy of 86%.</Paragraph>
      <Paragraph position="2"> Early experiments also revealed that an ensemble based on a majority vote of all 81 classifiers performed rather poorly. The accuracy for interest was approximately 81% and line was disambiguated with slightly less than 80% accuracy. The lesson taken from these results was that an ensemble should consist of classifiers that represent as differently sized windows of context as possible; this reduces the impact of redundant errors made by classifiers that represent very similarly sized windows of context.</Paragraph>
      <Paragraph position="3"> The ultimate success of an ensemble depends on the ability to select classifiers that make complementary errors. This is discussed in the context of combining part-of-speech taggers in (Brill and Wu, 1998). They provide a measure for assessing the complementarity of errors between two taggers that could be adapted for use with larger ensembles such as the one discussed here, which has nine members.</Paragraph>
    </Section>
    <Section position="4" start_page="67" end_page="67" type="sub_section">
      <SectionTitle>
5.4 Disambiguation by majority vote
</SectionTitle>
      <Paragraph position="0"> In this paper ensemble disambiguation is based on a simple majority vote of the nine member classifiers.</Paragraph>
      <Paragraph position="1"> An alternative strategy is to weight each vote by the estimated joint probability found by the Naive Bayesian classifier. However, a preliminary study found that the accuracy of a Naive Bayesian ensemble using a weighted vote was poor. For interest, it resulted in accuracy of 83% while for line it was 82%. The simple majority vote resulted in accuracy of 89% for interest and 88% for line.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="67" end_page="67" type="metho">
    <SectionTitle>
6 Future Work
</SectionTitle>
    <Paragraph position="0"> A number of issues have arisen in the course of this work that merit further investigation.</Paragraph>
    <Paragraph position="1"> The simplicity of the contextual representation can lead to large numbers of parameters in the Naive Bayesian model when using wide windows of context. Some combination of stop-lists and stemming could reduce the numbers of parameters and thus improve the overall quality of the parameter estimates made from the training data.</Paragraph>
    <Paragraph position="2"> In addition to simple co-occurrence features, the use of collocation features seems promising. These are distinct from co-occurrences in that they are words that occur in close proximity to the ambiguous word and do so to a degree that is judged statistically significant.</Paragraph>
    <Paragraph position="3"> One limitation of the majority vote in this paper is that there is no mechanism for dealing with outcomes where no sense gets a majority of the votes. This did not arise in this study but will certainly occur as Naive Bayesian ensembles are applied to larger sets of data.</Paragraph>
    <Paragraph position="4"> Finally, further experimentation with the size of the windows of context seems warranted. The current formulation is based on a combination of intuition and empirical study. An algorithm to determine optimal windows sizes is currently under development. null</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML