File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/n03-3004_metho.xml

Size: 19,896 bytes

Last Modified: 2025-10-06 14:08:14

<?xml version="1.0" standalone="yes"?>
<Paper uid="N03-3004">
  <Title>Discriminating Among Word Senses Using McQuitty's Similarity Analysis</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Discrimination Features
</SectionTitle>
    <Paragraph position="0"> We carry out discrimination based on surface lexical features that require little or no preprocessing to identify.</Paragraph>
    <Paragraph position="1"> They consist of unigrams, bigrams, and second order cooccurrences. null Unigrams are single words that occur in the same context as a target word. Bag-of-words feature sets made up of unigrams have had a long history of success in text classification and word sense disambiguation (Mooney, 1996), and we believe that despite creating quite a bit of noise can provide useful information for discrimination.</Paragraph>
    <Paragraph position="2"> Bigrams are pairs of words which occur together in the same context as the target word. They may include the target word, or they may not. We specify a window of size five for bigrams, meaning that there may be up to three intervening words between the first and last word that make up the bigram. As such we are defining bigrams to be non-consecutive word sequences, which could also be considered a kind of co-occurrence feature.</Paragraph>
    <Paragraph position="3"> Bigrams have recently been shown to be very successful features in supervised word sense disambiguation (Pedersen, 2001). We believe this is because they capture middle distance co-occurrence relations between words that occur in the context of the target word.</Paragraph>
    <Paragraph position="4"> Second order co-occurrences are words that occur with co-occurrences of the target word. For example, suppose that line is the target word. Given telephone line and telephone bill, bill would be considered a second order co-occurrence of line since it occurs with telephone, a first order co-occurrence of line.</Paragraph>
    <Paragraph position="5"> We define a window size of five in identifying second order co-occurrences, meaning that the first order co-occurrence must be within five positions of the target word, and the second order co-occurrence must be within five positions of the first order co-occurrence. We only select those second order co-occurrences which co-occur more than once with the first order co-occurrences which in turn co-occur more than once with the target word within the specified window.</Paragraph>
    <Paragraph position="6"> We employ a stop list to remove high frequency non-content words from all of these features. Unigrams that are included in the stop list are not used as features. A bi-gram is rejected if any word composing it is a stop word. Second order co-occurrences that are stop words or those that co-occur with stop words are excluded from the feature set.</Paragraph>
    <Paragraph position="7"> After the features have been identified in the training data, all of the instances in the test data are converted into binary feature vectors a0a2a1a4a3a6a5a7a1a9a8a10a5a12a11a6a11a12a11a12a5a7a1a14a13a16a15 that represent whether the features found in the training data have occurred in a particular test instance. In order to cluster these instances, we measure the pair-wise similarities between them using matching and cosine coefficients.</Paragraph>
    <Paragraph position="8"> These values are formatted in a a0a2a1a3a0 similarity matrix such that cell a0a5a4 a5a7a6 a15 contains the similarity measure between instances a4 and a6 . This information serves as the input to the clustering algorithm that groups together the most similar instances.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Experimental Methodology
</SectionTitle>
    <Paragraph position="0"> We evaluate our method using two well known sources of sense-tagged text. In supervised learning sense-tagged text is used to induce a classifier that is then applied to held out test data. However, our approach is purely unsupervised and we only use the sense tags to carry out an automatic evaluation of the discovered clusters. We follow Sch&amp;quot;utze's strategy and use a &amp;quot;training&amp;quot; corpus only to extract features and ignore the sense tags.</Paragraph>
    <Paragraph position="1"> In particular, we use subsets of the line data (Leacock et al., 1993) and the English lexical sample data from the SENSEVAL-2 comparative exercise among word sense disambiguation systems (Edmonds and Cotton, 2001).</Paragraph>
    <Paragraph position="2"> The line data contains 4,146 instances, where each consists of two to three sentences where a single occurrence of line has been manually tagged with one of six possible senses. We randomly select 100 instances of each sense for test data, and 200 instances of each sense for training. This gives a total of 600 evaluation instances, and 1200 training instances. This is done to test the quality of our discrimination method when senses are uniformly distributed and where no particular sense is dominant.</Paragraph>
    <Paragraph position="3"> The standard distribution of the SENSEVAL-2 data consists of 8,611 training instances and 4,328 test instances. Each instance is made up of two to three sentences where a single target word has been manually tagged with a sense (or senses) appropriate for that context. There are 73 distinct target words found in this data; 29 nouns, 29 verbs, and 15 adjectives. Most of these words have less than 100 test instances, and approximately twice that number of training examples. In general these are relatively small samples for an unsupervised approach, but we are developing techniques to increase the amount of training data for this corpus automatically. null We filter the SENSEVAL-2 data in three different ways to prepare it for processing and evaluation. First, we insure that it only includes instances whose actual sense is among the top five most frequent senses as observed in the training data for that word. We believe that this is an aggressive number of senses for a discrimination system to attempt, considering that (Pedersen and Bruce, 1997) experimented with 2 and 3 senses, and (Sch&amp;quot;utze, 1998) made binary distinctions.</Paragraph>
    <Paragraph position="4"> Second, instances may have been assigned more than one correct sense by the human annotator. In order to simplify the evaluation process, we eliminate all but the most frequent of multiple correct answers.</Paragraph>
    <Paragraph position="5"> Third, the SENSEVAL-2 data identifies target words that are proper nouns. We have elected not to use that information and have removed these P tags from the data.</Paragraph>
    <Paragraph position="6"> After carrying out these preprocessing steps, the number of training and test instances is 7,476 and 3,733.</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Evaluation Technique
</SectionTitle>
    <Paragraph position="0"> We specify an upper limit on the number of senses that McQuitty's algorithm can discover. In these experiments this value is five for the SENSEVAL-2 data, and six for line. In future experiments we will specify even higher values, so that the algorithm is forced to create larger number of clusters with very few instances when the actual number of senses is smaller than the given cutoff.</Paragraph>
    <Paragraph position="1"> About a third of the words in the SENSEVAL-2 data have fewer than 5 senses, so even now the clustering algorithm is not always told the correct number of clusters it should find.</Paragraph>
    <Paragraph position="2"> Once the clusters are formed, we access the actual correct sense of each instance as found in the sense-tagged text. This information is never utilized prior to evaluation. We use the sense-tagged text as a gold standard by which we can evaluate the discovered sense clusters. We assign sense tags to clusters such that the resulting accuracy is maximized.</Paragraph>
    <Paragraph position="3"> For example, suppose that five clusters (C1 - C5) have been discovered for a word with 100 instances, and that the number of instances in each cluster is 25, 20, 10, 25, and 20. Suppose that there are five actual senses (S1 -S5), and the number of instances for each sense is 20, 20, 20, 20, and 20. Figure 1 shows the resulting confusion matrix if the senses are assigned to clusters in numeric order. After this assignment is made, the accuracy of the clustering can be determined by finding the sum of the diagonal, and dividing by the total number of instances, which in this case leads to accuracy of 10% (10/100).</Paragraph>
    <Paragraph position="4"> However, clearly there are assignments of senses to clusters that would lead to better results.</Paragraph>
    <Paragraph position="5"> Thus, the problem of assigning senses to clusters becomes one of reordering the columns of the confusion such that the diagonal sum is maximized. This corresponds to several well known problems, among them the  mining the maximal matching of a bipartite graph. Figure 2 shows the maximally accurate assignment of senses to clusters, which leads to accuracy of 70% (70/100).</Paragraph>
    <Paragraph position="6"> During evaluation we assign one cluster to at most one sense, and vice versa. When the number of discovered clusters is the same as the number of senses, then there is a 1 to 1 mapping between them. When the number of clusters is greater than the number of actual senses, then some clusters will be left unassigned. And when the  number of senses is greater than the number of clusters, some senses will not be assigned to any cluster.</Paragraph>
    <Paragraph position="7"> We determine the precision and recall based on this maximally accurate assignment of sense tags to clusters.</Paragraph>
    <Paragraph position="8"> Precision is defined as the number of instances that are clustered correctly divided by the number of instances clustered, while recall is the number of instances clustered correctly over the total number of instances.</Paragraph>
    <Paragraph position="9"> To be clear, we do not believe that word sense discrimination must be carried out relative to a pre-existing set of senses. In fact, one of the great advantages of an unsupervised approach is that it need not be relative to any particular set of senses. We carry out this evaluation technique in order to improve the performance of our clustering algorithm, which we will then apply on text where sense-tagged data is not available.</Paragraph>
    <Paragraph position="10"> An alternative means of evaluation is to have a human inspect the discovered clusters and judge them based on the semantic coherence of the instances that populate each cluster, but this is a more time consuming and subjective method of evaluation that we will pursue in future.</Paragraph>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
6 Experimental Results
</SectionTitle>
    <Paragraph position="0"> For each word in the SENSEVAL-2 data and line, we conducted various experiments, each of which uses a different combination of measure of similarity and features.</Paragraph>
    <Paragraph position="1"> Features are identified from the training data. Our features consist of unigrams, bigrams, or second order cooccurrences. We employ each of these three types of features separately, and we also create a mixed set that is the union of all three sets. We convert each evaluation instance into a feature vector, and then convert those into a similarity matrix using either the matching coefficient or the cosine.</Paragraph>
    <Paragraph position="2"> Table 1 contains overall precision and recall for the nouns, verbs, and adjectives overall in the SENSEVAL-2 data, and for line. The SENSEVAL-2 values are derived from 29 nouns, 28 verbs, and 15 adjectives from the SENSEVAL-2 data. The first column lists the part of speech, the second shows the feature, the third lists the measure of similarity, the fourth and the fifth show precision and recall, the sixth shows the percentage of the majority sense, and the final column shows the number of words in the given part of speech that gave accuracy greater than the percentage of the majority sense. The value of the majority sense is derived from the sense-tagged data we use in evaluation, but this is not information that we would presume to have available during actual clustering.</Paragraph>
    <Paragraph position="3">  For the SENSEVAL-2 data, on average the precision and recall of the clustering as determined by our evaluation method is less than that of the majority sense, regardless of which features or measure are used. However, for nouns and verbs, a relatively significant number of individual words have precision and recall values higher than that of the majority sense. The adjectives are an exception to this, where words are very rarely disambiguated more accurately than the percentage of the majority sense. However, many of the adjectives have very high frequency majority senses, which makes this a difficult standard for an unsupervised method to reach.</Paragraph>
    <Paragraph position="4"> When examining the distribution of instances in clusters, we find that the algorithm tends to seek more balanced distributions, and is unlikely to create a single long cluster that would result in high accuracy for a word whose true distribution of senses is heavily skewed towards a single sense.</Paragraph>
    <Paragraph position="5"> We also note that the precision and recall of the clustering of the line data is generally better than that of the majority sense regardless of the features or measures employed. We believe there are two explanations for this.</Paragraph>
    <Paragraph position="6"> First, the number of training instances for the line data is significantly higher (1200) than that of the SENSEVAL-2 words, which typically have 100-200 training instances per word. The number and quality of features identified improves considerably with an increase in the amount of training data. Thus, the amount of training data available for feature identification is critically important. We believe that the SENSEVAL-2 data could be augmented with training data taken from the World Wide Web, and we plan to pursue such approaches and see if our performance on the evaluation data improves as a result.</Paragraph>
    <Paragraph position="7"> At this point we do not observe a clear advantage to using the cosine measure or matching coefficient. This surprises us somewhat, as the number of features employed is generally in the thousands, and the number of non-zero features can be quite large. It would seem that simply counting the number of matching features would be inferior to the cosine measure, but this is not the case. This remains an interesting issue that we will continue to explore, with these and other measures of similarity.</Paragraph>
    <Paragraph position="8"> Finally, there is not a single feature that does best in all parts of speech. Second order co-occurrences seem to do well with nouns and adjectives, while bigrams result in accurate clusters for verbs. We also note that second order co-occurrences do well with the line data. As yet we have drawn no conclusions from these results, but it is clearly a vital issue to investigate further.</Paragraph>
  </Section>
  <Section position="8" start_page="0" end_page="0" type="metho">
    <SectionTitle>
7 Related Work
</SectionTitle>
    <Paragraph position="0"> Unsupervised approaches to word sense discrimination have been somewhat less common in the computational linguistics literature, at least when compared to supervised approaches to word sense disambiguation.</Paragraph>
    <Paragraph position="1"> There is a body of work at the intersection of supervised and unsupervised approaches, which involves using a small amount of training data in order to automatically create more training data, in effect bootstrapping from the small sample of sense-tagged data. The best example of such an approach is (Yarowsky, 1995), who proposes a method that automatically identifies collocations that are indicative of the sense of a word, and uses those to iteratively label more examples.</Paragraph>
    <Paragraph position="2"> While our focus has been on Pedersen and Bruce, and on Sch&amp;quot;utze, there has been other work in purely unsupervised approaches to word sense discrimination.</Paragraph>
    <Paragraph position="3"> (Fukumoto and Suzuki, 1999) describe a method for discriminating among verb senses based on determining which nouns co-occur with the target verb. Collocations are extracted which are indicative of the sense of a verb based on a similarity measure they derive.</Paragraph>
    <Paragraph position="4"> (Pantel and Lin, 2002) introduce a method known as Committee Based Clustering that discovers word senses.</Paragraph>
    <Paragraph position="5"> The words in the corpus are clustered based on their distributional similarity under the assumption that semantically similar words will have similar distributional characteristics. In particular, they use Pointwise Mutual Information to find how close a word is to its context and then determine how similar the contexts are using the cosine coefficient.</Paragraph>
  </Section>
  <Section position="9" start_page="0" end_page="0" type="metho">
    <SectionTitle>
8 Future Work
</SectionTitle>
    <Paragraph position="0"> Our long term goal is to develop a method that will assign sense labels to clusters using information found in machine readable dictionaries. This is an important problem because clusters as found in discrimination have no sense tag or label attached to them. While there are certainly applications for unlabeled sense clusters, having some indication of the sense of the cluster would bring discrimination and disambiguation closer together. We will treat glosses as found in a dictionary as vectors that we project into the same space that is populated by instances as we have already described. A cluster could be assigned the sense of the gloss whose vector it was most closely located to.</Paragraph>
    <Paragraph position="1"> This idea is based loosely on work by (Niwa and Nitta, 1994), who compare word co-occurrence vectors derived from large corpora of text with co-occurrence vectors based on the definitions or glosses of words in a machine readable dictionary. A co-occurrence vector indicates how often words are used with each other in a large corpora or in dictionary definitions. These vectors can be projected into a high dimensional space and used to measure the distance between concepts or words. Niwa and Nitta show that while the co-occurrence data from a dictionary has different characteristics that a co-occurrence vector derived from a corpus, both provide useful information about how to categorize a word based on its meaning. Our future work will mostly attempt to merge clusters found from corpora with meanings in dictionaries where presentation techniques like co-occurrence vectors could be useful.</Paragraph>
    <Paragraph position="2"> There are a number of smaller issues that we are investigating. We are also exploring a number of other types of features, as well as varying the formulation of the features we are currently using. We have already conducted a number of experiments that vary the window sizes employed with bigrams and second order co-occurrences, and will continue in this vein. We are also considering the use of other measures of similarity beyond the matching coefficient and the cosine. We do not stem the training data prior to feature identification, nor do or employ fuzzy matching techniques when converting evaluation instances into feature vectors. However, we believe both might lead to increased numbers of useful features being identified.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML