File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-1022_metho.xml
Size: 23,343 bytes
Last Modified: 2025-10-06 14:08:25
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-1022"> <Title>Supersense Tagging of Unknown Nouns in WordNeta0</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Lexicographer Classes for Noun </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Classification 3.1 WordNet Lexicographer Labels </SectionTitle> <Paragraph position="0"> WordNet (Fellbaum, 1998) is a broad-coverage machine-readable dictionary. Release 1.71 of the English version lists about 150,000 entries for all open-class words, mostly nouns (109,000 types), but also verbs, adjectives, and adverbs. WordNet is organized as a network of lexicalized concepts, sets of synonyms called synsets; e.g., the nouns a21 chairman, chairwoman, chair, chairpersona22 form a synset. A word that belongs to several synsets is ambiguous.</Paragraph> <Paragraph position="1"> To facilitate the development of WordNet, lexicographers organize synsets into several domains, based on syntactic category and semantic coherence.</Paragraph> <Paragraph position="2"> Each noun synset is assigned one out of 26 broad categories1. Since these broad categories group together very many synsets, i.e., word senses, we call them supersenses. The supersense labels that Word-Net lexicographers use to organize nouns are listed in Table 12. Notice that since the lexicographer labels are assigned to synsets, often ambiguity is preserved even at this level. For example, chair has three supersenses: &quot;person&quot;, &quot;artifact&quot;, and &quot;act&quot;. This set of labels has a number of attractive features for the purposes of lexical acquisition. It is fairly general and therefore small. The reasonable size of the label set makes it possible to apply state-of-the-art machine learning methods. Otherwise, classifying new words at the synset level defines a multiclass problem with a huge class space - more than 66,000 noun synsets in WordNet 1.6, more than 75,000 in the newest release, 1.71 (cf. also (Ciaramita, 2002) on this problem). At the same time the labels are not too abstract or vague. Most of the classes seem natural and easily recognizable. That is probably why they were chosen by the lexicographers to facilitate their task. But there are more important practical and methodological advantages.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Extra Training Data from WordNet </SectionTitle> <Paragraph position="0"> WordNet contains a great deal of information about words and word senses.The information contained such as &quot;phenomenon&quot; &quot;entity&quot; &quot;object&quot; etc. in the dictionary's glosses is very similar to what is typically listed in normal dictionaries: synonyms, definitions and example sentences. This suggests a very simple way in which it can be put into use: it can be compiled into training data for supersense labels. This data can then be added to the data extracted from the training corpus.</Paragraph> <Paragraph position="1"> For several thousand concepts WordNet's glosses are very informative. The synset &quot;chair&quot; for example looks as follows:</Paragraph> <Paragraph position="3"> chair, chairperson - (the officer who presides at the meetings of an organization); &quot;address your remarks to the chairperson&quot;.</Paragraph> <Paragraph position="4"> In WordNet 1.6, 66,841 synsets contain definitions (in parentheses above), and 6,147 synsets contain example sentences (in quotation marks). As we show below, this information about word senses is useful for supersense tagging. Presumably this is because if it can be said of a &quot;chairperson&quot; that she can &quot;preside at meetings&quot; or that &quot;a remark&quot; can be &quot;addressed to her&quot;, then logically speaking these things can be said of the superordinates of &quot;chairperson&quot;, like &quot;person&quot;, as well.</Paragraph> <Paragraph position="5"> Therefore information at the synset level is relevant also at the supersense level. Furthermore, while individually each gloss doesn't say too much about the narrow concept it is attached to (at least from a machine learning perspective) at the supersense level this information accumulates. In fact it forms a small corpus of supersense-annotated data that can be used to train a classifier for supersense tagging of words or for other semantic classification tasks.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.3 Evaluation Methods </SectionTitle> <Paragraph position="0"> Formulating the problem in this fashion makes it possible to define also a very natural evaluation procedure. Systems can be trained on nouns listed in a given release of WordNet and tested on the nouns introduced in a later version. The set of lexicographer labels remains constant and can be used across different versions.</Paragraph> <Paragraph position="1"> In this way systems can be tested on a more realistic lexical acquisition task - the same task that lexicographers carried out to extend the database. The task is then well defined and motivated, and easily standardizable.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.4 Relation to Named-Entity Tasks </SectionTitle> <Paragraph position="0"> The categories typically used in named-entity recognition tasks are a subset of the noun supersense labels: &quot;person&quot;, &quot;location&quot;, and &quot;group&quot;. Small label sets like these can be sufficient in named-entity recognition. Collins and Singer (1999) for example report that 88% of the named entities occurring in their data set belong to these three categories (Collins and Singer, 1999).</Paragraph> <Paragraph position="1"> The distribution of common nouns, however, is more uniform. We estimated this distribution by counting the occurrences of 744 unambiguous common nouns newly introduced in WordNet 1.71. Figure 1 plots the cumulative frequency distribution of supersense tokens; the labels are ordered by decreasing relative frequency as in Table 1.</Paragraph> <Paragraph position="2"> The most frequent supersenses are &quot;person&quot;, &quot;communication&quot;, &quot;artifact&quot; etc. The three most frequent supersenses account for a little more of 50% of all tokens, and 9 supersenses account for 90% of all tokens. A larger number of labels is needed for supersense tagging than for named-entity recognition. The figure also shows the distribution of labels for all unambiguous tokens in WordNet 1.6; the two distributions are quite similar.</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Experiments </SectionTitle> <Paragraph position="0"> The &quot;new&quot; nouns in WordNet 1.71 and the &quot;old&quot; ones in WordNet 1.6 constitute the test and training data that we used in our word classification experiments. Here we describe the experimental setup: training and test data, and features used.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Training data </SectionTitle> <Paragraph position="0"> We extracted from the Bllip corpus all occurrences of nouns that have an entry in WordNet 1.6.</Paragraph> <Paragraph position="1"> Bllip (BLLIP, 2000) is a 40-million-word syntactically parsed corpus. We used the parses to extract the syntactic features described below. We then removed all ambiguous nouns, i.e., nouns that are tagged with more than one supersense label (72% of the tokens, 28.9% of the types). In this way we avoided dealing with the problem of ambiguity3.</Paragraph> <Paragraph position="2"> We extracted a feature vector for each noun instance using the feature set described below. Each vector is a training instance. In addition we compiled another training set from the example sentences and from the definitions in the noun database of WordNet 1.6. Overall this procedure produced 787,186 training instances from Bllip, 66,841 training instances from WordNet's definitions, and 6,147 training instances from the example sentences.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Features </SectionTitle> <Paragraph position="0"> We used a mix of standard features used in word sense disambiguation, named-entity classification and lexical acquisition. The following sentence il- null lustrates them: &quot;The art-students, nine teen-agers, read the book&quot;, art-students is the tagged noun: 1. part of speech of the neighboring words: a31a33a32a35a34a37a36a39a38a33a40 , a31a18a41a42a36a44a43a33a43a4a45 , a31a14a46 a34 a36a48a47a49a38 , ... 2. single words in the surrounding context: a47a50a36a39a51a18a52a18a53a55a54 , a47a50a36a39a56a4a57a33a57a59a58 , a47a50a36a61a60a63a62a33a53a18a64a33a64 , a47a50a36a39a65a63a66a4a52 , ... 3. bigrams and trigrams: a47a4a32a35a34a17a67a46 a34a6a36a39a65a63a66a4a52 a68a14a69a70a68a4a52 , a47 a32a35a34a17a67a32a71a34 a36a39a65a63a66a4a52 , a47a18a46 a34a17a67a46a35a72a73a36a44a68a14a69a59a68a4a52 a65a18a52a33a52a49a68a75a74a76a53a55a77a18a52a63a51a4a64 , ... 4. syntactically governed elements under a given phrase: 6. coordinates/appositives: a47a33a79a80a36a39a65a18a52a63a52a55a68a75a74a76a53a55a77a18a52a55a51a81a64 7. spelling/morphological features: prefixes, suffixes, complex morphology: a82a33a31a83a36a48a53 , a82a33a31a83a36a48a53a55a51 ... a82a81a45a50a36a61a64 , a82a4a45a84a36a48a65a81a64 ... a82a4a47a80a36a48a53a55a51a33a65 , a82a4a47a80a36a61a64a49a65a63a85a33a54a18a52a55a68a18a65 ... 3A simple option to deal with ambiguous words would be to distribute an ambiguous noun's counts to all its senses. However, in preliminary experiments we found that a better accuracy is achieved using only non-ambiguous nouns. We will investigate this issue in future research.</Paragraph> <Paragraph position="1"> Open class words were morphologically simplified with the &quot;morph&quot; function included in Word-Net. We parsed the WordNet definitions and example sentences with the same syntactic parser used for Bllip (Charniak, 2000).</Paragraph> <Paragraph position="2"> It is not always possible to identify the noun that represents the synset in the WordNet glosses. For example, in the gloss for the synset relegation the example sentence is &quot;He has been relegated to a post in Siberia&quot;, where a verb is used instead of the noun.</Paragraph> <Paragraph position="3"> When it was possible to identify the target noun the complete feature set was used; otherwise only the surrounding-word features (2) and the spelling features (7) of all synonyms were used. With the definitions it is much harder to individuate the target; consider the definition &quot;a member of the genus Canis&quot; for dog. For all definitions we used only the reduced feature set. One training instance per synset was extracted from the example sentences and one training instance from the definitions. Overall, in the experiments we performed we used around 1.5 million features.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.3 Evaluation </SectionTitle> <Paragraph position="0"> In a similar way to how we produced the training data we compiled a test set from the Bllip corpus.</Paragraph> <Paragraph position="1"> We found all instances of nouns that are not in Word-Net 1.6 but are listed in WordNet 1.71 with only one supersense. The majority of the novel nouns in WordNet 1.71 are unambiguous (more than 90%).</Paragraph> <Paragraph position="2"> There were 744 new noun types, with a total frequency of 9,537 occurrences. We refer to this test set as Testa86a70a87a88a55a86 .</Paragraph> <Paragraph position="3"> We also randomly removed 755 noun types (20,394 tokens) from the training data and used them as an alternative test set. We refer to this other test set as Testa86a70a87a89 . We then ran experiments using the averaged multiclass perceptron.</Paragraph> </Section> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 The Multiclass Averaged Perceptron </SectionTitle> <Paragraph position="0"> We used a multiclass averaged perceptron classifier, which is an &quot;ultraconservative&quot; on-line learning algorithm (Crammer and Singer, 2002), that is a multiclass extension of the standard perceptron learning to the multiclass case. It takes as input a training set</Paragraph> <Paragraph position="2"> , where each instance a3 a10a105a104a107a106a108a110a109 represents an instance of a noun and a97a30a111a112a104a92a113 . Here a113 is the set of supersenses defined by WordNet. Since for training and testing we used only unambiguous words there is always exactly one label per instance.</Paragraph> <Paragraph position="3"> Thus a90 summarizes a7 word tokens that belong to the dictionary, where each instance a27 is represented as a vector of features a3 a10 extracted from the context in which the noun occurred; a146 is the total number of features; and a97 a10 is the true label of a3a28a10 .</Paragraph> <Paragraph position="4"> In general, a multiclass classifier for the dictionary is a function a121 a133a124a106a108 a100a148a147 a113 that maps feature vectors a3 to one of the possible supersenses of WordNet. In the multiclass perceptron, one introduces a weight vector a16 a125 a104a112a106a108 a109 for every a97a39a104a48a113 and defines a121 implicitly by the so-called winner-take-all</Paragraph> <Paragraph position="6"> Here a115 a104a157a106a108a110a158a35a159 a109 refers to the matrix of weights, with every column corresponding to one of the weight vectors a16a18a125 .</Paragraph> <Paragraph position="7"> The learning algorithm works as follows: Training patterns are presented one at a time in the standard on-line learning setting. Whenever</Paragraph> <Paragraph position="9"> a97a71a10 an update step is performed; otherwise the weight vectors remain unchanged. To perform the update, one first computes the error set a132 a10 containing those class labels that have received a higher score than the correct class:</Paragraph> <Paragraph position="11"> An ultraconservative update scheme in its most general form is then defined as follows: Update a16a96a125a48a128</Paragraph> <Paragraph position="13"> that the update is balanced, which is crucial to guaranteeing the convergence of the learning procedure (cf. (Crammer and Singer, 2002)). We have focused on the simplest case of uniform update weights,</Paragraph> <Paragraph position="15"> rized in Algorithm 1.</Paragraph> <Paragraph position="16"> Notice that the multiclass perceptron algorithm learns all weight vectors in a coupled manner, in contrast to methods that perform multiclass classification by combining binary classifiers, for example, training a classifier for each class in a one-againstthe-rest manner.</Paragraph> <Paragraph position="17"> The averaged version of the perceptron (Collins, 2002), like the voted perceptron (Freund and Schapire, 1999), reduces the effect of over-training. In addition to the matrix of weight vectors a115 the model keeps track for each feature a170 of each value it assumed during training, a170 a111 , and the number of consecutive training instance presentations during which this weight was not changed, or &quot;life span&quot;, For example, if there is a feature weight that is not updated until example 500, at which point it is incremented to value 1, and is not touched again until after example 1000, then the average weight of that feature in the averaged perceptron at example 750 will be: a174a176a175a59a177a114a178a70a175a70a175a59a179 a86 a177a114a180a70a178a70a175a59a181 a174a176a178a70a175a70a175a59a179a42a180a70a178a70a175a59a181 , or 1/3. At example 1000 it will be 1/2, etc. We used the averaged model for evaluation and parameter setting; see below. Figure 2 plots the results on test data of both models. The average model produces a betterperforming and smoother output.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.1 Parameters Setting </SectionTitle> <Paragraph position="0"> We used an implementation with full, i.e., not sparse, representation of the matrix for the perceptron. Training and test are fast, at the expense of a 0 100 200 300 400 500 600 700 800 900 100027 perceptron slightly greater memory load. Given the great number of features, we couldn't use the full training set from the Bllip corpus. Instead we randomly sampled from roughly half of the available training data, yielding around 400,000 instances, the size of the training is close to 500,000 instances with also the WordNet data. When training to test on Testa86a70a182a89 , we removed from the WordNet training set the synsets relative to the nouns in Testa86a70a87a89 .</Paragraph> <Paragraph position="1"> The only adjustable parameter to set is the number of passes on the training data, or epochs. While testing on Testa86a70a87a88a55a86 we set this parameter using Testa86a70a87a89 , and vice versa for Testa86a70a87a89 . The estimated values for the stopping iterations were very close at roughly ten passes. As Figure 2 shows, the great amount of data requires many passes over the data, around 1,000, before reaching convergence (on Testa86a70a87a88a55a86 ).</Paragraph> </Section> </Section> <Section position="7" start_page="0" end_page="0" type="metho"> <SectionTitle> 6 Results </SectionTitle> <Paragraph position="0"> The classifier outputs the estimated supersense label of each instance of each unknown noun type. The label a183 a93 a7 a98 of a noun type a7 is obtained by voting4:</Paragraph> <Paragraph position="2"> where a188 a120a191a189 is the indicator function and a3 a104 a7 means that a3 is a token of type a7 . The score on a7 is 1 if 4During preliminary experiments we tried also creating one single aggregate pattern for each test noun type but this method produced worse results.</Paragraph> <Paragraph position="4"> Table 2 summarizes the results of the experiments on Testa86a70a87a88a55a86 (upper half) and on Testa86a70a87a89 (bottom half). A baseline was computed that always selected the most frequent label in the training data, &quot;person&quot;, which is also the most frequent in both Testa86a70a87a89 and Testa86a70a87a88a55a86 . The baseline performances are in the low twenties. The first and second columns report performance on tokens and types respectively.</Paragraph> <Paragraph position="5"> The classifiers' results are averages over 50 trials in which a fraction of the Bllip data was randomly selected. One classifier was trained on 55% of the Bllip data (AP-B-55). An identical one was trained on the same data and, additionally, on the WordNet data (AP-B-55+WN). We also trained a classifier on 65% of the Bliip data (AP-B-65). Adding the Word-Net data to this training set was not possible because of memory limitations. The model also trained on WordNet outperforms on both test sets those trained only on the Bllip data. A paired t-test proved the difference between models with and without Word-Net data to be statistically significant. The &quot;least&quot; significant difference is between AP-B-65 and AP-B-55+WN (token) on Testa86a70a87a89 : a192 a91a193a116 a120a116a71a116a35a194 . In all other cases the a192 -level is much smaller.</Paragraph> <Paragraph position="6"> These results seem to show that the positive impact of the WordNet data is not simply due to the fact that there is more training data5. Adding the WordNet data seems more effective than adding an equivalent amount of standard training data. Figure 3 plots the results of the last set of (single trial) experiments we performed, in which we varied the 5Notice that 10% of the Bllip data is approximately the size of the WordNet data and therefore AP-B-65 and AP-B-55+WN are trained on roughly the same amount of data.</Paragraph> <Paragraph position="7"> amount of Bllip data to be added to the WordNet one. The model with WordNet data often performs better than the model trained only on Bllip data even when the latter training set is much larger.</Paragraph> <Paragraph position="8"> Two important reasons why the WordNet data is particularly good are, in our opinion, the following.</Paragraph> <Paragraph position="9"> The data is less noisy because it is extracted from sentences and definitions that are always &quot;pertinent&quot; to the class label. The data also contains instances of disambiguated polysemous nouns, which instead were excluded from the Bllip training. This means that disambiguating the training data is important; unfortunately this is not a trivial task. Using the WordNet data provides a simple way of getting at least some information from ambiguous nouns.</Paragraph> </Section> <Section position="8" start_page="0" end_page="0" type="metho"> <SectionTitle> 7 Differences Between Test Sets </SectionTitle> <Paragraph position="0"> The type scores on both evaluations produced similar results. This finding supports the hypothesis that the two evaluations are similar in difficulty, and that the two versions of WordNet are not inconsistent in the way they assign supersenses to nouns.</Paragraph> <Paragraph position="1"> The evaluations show, however, very different patterns at the token level. This might be due to the fact that the label distribution of the training data is more similar to Testa86a70a87a89 than to Testa86a70a87a88a55a86 . In particular, there are many new nouns in Testa86a70a87a88a55a86 that belong to &quot;abstract&quot; classes6, which seem harder to learn. Abstract classes are also more confusable; i.e., mem6Such as &quot;communication&quot; (e.g., reaffirmation) or &quot;cognition&quot; (e.g., mind set).</Paragraph> <Paragraph position="2"> by the frequency of the test words.</Paragraph> <Paragraph position="3"> bers of these classes are frequently mis-classified with the same wrong label. A few very frequently mis-classified pairs are communication/act, communication/person and communication/artifact.</Paragraph> <Paragraph position="4"> As a result of the fact that abstract nouns are more frequent in Testa86a70a87a88a55a86 than in Testa86a70a87a89 the accuracy on tokens is much worse in the new evaluation than in the more standard one. This has an impact also on the type scores. Figure 4 plots the results on types for Testa86a70a87a89 and Testa86a70a87a88a55a86 grouped in bins of test noun types ranked by decreasing frequency. It shows that the first bin is harder in Testa86a70a87a88a55a86 than in Testa86a70a87a89 . Overall, then, it seems that there are similarities but also important differences between the evaluations. Therefore the new evaluation might define a more realistic task than cross-validation.</Paragraph> </Section> class="xml-element"></Paper>