File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/p06-2007_evalu.xml
Size: 8,913 bytes
Last Modified: 2025-10-06 13:59:43
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-2007"> <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics N Semantic Classes are Harder than Two</Title> <Section position="8" start_page="51" end_page="54" type="evalu"> <SectionTitle> 6 Experiments </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="51" end_page="51" type="sub_section"> <SectionTitle> 6.1 Baseline Comparison to Snow et al.'s Previous Hypernym Classification on WordNet-TREC data </SectionTitle> <Paragraph position="0"> Snow et al. (2005) evaluated binary classification of noun-phrase pairs as hypernyms or non-hypernyms. When training and testing on WordNet-labeled pairs from TREC sentences, they report classifier Max F of 0.348, using dependency path features and logistic regression. To justify our choice of an SVM for classification, we replicated their work. Snow et al. provided us with their data. With our SVM we achieved a Max F of 0.453, 30% higher than they reported.</Paragraph> </Section> <Section position="2" start_page="51" end_page="52" type="sub_section"> <SectionTitle> 6.2 Extending Snow et al.'s WordNet-TREC </SectionTitle> <Paragraph position="0"> Binary Classification to N Classes Snow et al. select pairs that are &quot;Known Hypernyms&quot; (the first sense of the first word is a hy- null ponym of the first sense of the second and both have no more than one tagged sense in the Brown corpus) and &quot;Known Non-Hypernyms&quot; (no sense of the first word is a hyponym of any sense of the second). We wished to test whether making the classes less cleanly separable would affect the results, and also whether we could use these features for n-way classification.</Paragraph> <Paragraph position="1"> From the same TREC corpus we extracted known synonym, known hyponym, known coordinate, known meronym, and known holonym pairs. Each of these classes is defined analogously to the known hypernym class; we selected these six relationships because they are the six most common. A pair is labeled known no-relationship if no sense of the first word has any relationship to any sense of the second word. The class distribution was selected to match as closely as possible that observed in query logs. We labeled 50,000 pairs total.</Paragraph> <Paragraph position="2"> Results are shown in Table 4(a). Although AUC is fairly high for all classes, MaxF is low for all but two. MaxF has degraded quite a bit for hypernyms from Table 3. Removing all instances except hypernym and no relationship brings MaxF up to 0.45, suggesting that the additional classes make it harder to separate hypernyms.</Paragraph> <Paragraph position="3"> Metaclassifier accuracy is very good, but this is due to high recall of no relationship and coordinate pairs: more than 80% of instances with some relationship are predicted to be coordinates, and most of the rest are predicted no relationship. It seems that we are only distinguishing between no vs. some relationship.</Paragraph> <Paragraph position="4"> The size of the no relationship class may be biasing the results. We removed those instances, but performance of the n-class classifier did not improve (Table 4(b)). MaxF of binary classifiers did improve, even though AUC is much worse.</Paragraph> </Section> <Section position="3" start_page="52" end_page="52" type="sub_section"> <SectionTitle> 6.3 N-Class Classification of Query Pairs </SectionTitle> <Paragraph position="0"> We now use query pairs rather than TREC pairs.</Paragraph> </Section> <Section position="4" start_page="52" end_page="52" type="sub_section"> <SectionTitle> 6.3.1 Classification Using Only Dependency Paths </SectionTitle> <Paragraph position="0"> We first limit features to dependency paths in order to compare to the prior results. Dependency paths cannot be obtained for all query phrase pairs, since the two phrases must appear in the same sentence together. We used only the pairs for which we could get path features, about 32% of the total.</Paragraph> <Paragraph position="1"> Table 5(a) shows results of binary classification and metaclassification on those instances using dependency path features only. We can see that dependency paths do not perform very well on their own: most instances are assigned to the &quot;coordinate&quot; class that comprises a plurality of instances. A comparison of Tables 5(a) and 4(a) suggests that classifying query substitution pairs is harder than classifying TREC phrases.</Paragraph> <Paragraph position="2"> Table 5(b) shows the results of binary classification and metaclassification on the same instances using all features. Using all features improves performance dramatically on each individual binary classifier as well as the metaclassifier.</Paragraph> </Section> <Section position="5" start_page="52" end_page="53" type="sub_section"> <SectionTitle> 6.3.2 Classification on All Query Pairs Using All Features </SectionTitle> <Paragraph position="0"> We now expand to all of our hand-labeled pairs.</Paragraph> <Paragraph position="1"> Table 6(a) shows results of binary and meta classification; Figure 1 shows precision-recall curves for 10 binary classifiers (excluding URLs). Our classifier does quite well on every class but hypernym and hyponym. These two make up a very small percentage of the data, so it is not surprising that performance would be so poor.</Paragraph> <Paragraph position="2"> The metaclassifier achieved 71% accuracy. This is significantly better than random or majorityclass baselines, and close to our 78% interannotator agreement. Thresholding the metaclassifier to pairs with greater than .5 max class probability (68% of instances) gives 85% accuracy.</Paragraph> <Paragraph position="3"> Next we wish to see how much of the performance can be maintained without using the com- null (b) Removing no relationship instances improves MaxF and recall of all classes, but performance is generally worse.</Paragraph> <Paragraph position="4"> all our features significantly improves performance over just using dependency paths. putationally expensive syntactic parsing of dependency paths. To estimate the marginal gain of the other features over the dependency paths, we excluded the latter features and retrained our classifiers. Results are shown in Table 6(b). Even though binary and meta-classifier performance decreases on all classes but generalizations and specifications, much of the performance is maintained. Because URL changes are easily identifiable by the IsURL feature, we removed those instances and retrained the classifiers. Results are shown in Table 6(c). Although overall accuracy is worse, individual class performance is still high, allowing us to conclude our results are not only due to the ease of classifying URLs.</Paragraph> <Paragraph position="5"> We generated a learning curve by randomly sampling instances, training the binary classifiers on that subset, and training the metaclassifier on the results of the binary classifiers. The curve is shown in Figure 2. With 10% of the instances, we have a metaclassifier accuracy of 59%; with 100% of the data, accuracy is 71%. Accuracy shows no sign of falling off with more instances.</Paragraph> </Section> <Section position="6" start_page="53" end_page="54" type="sub_section"> <SectionTitle> 6.4 Training on WordNet-Labeled Pairs Only </SectionTitle> <Paragraph position="0"> Figure 2 implies that more labeled instances will lead to greater accuracy. However, manually labeled instances are generally expensive to obtain.</Paragraph> <Paragraph position="1"> Here we look to other sources of labeled instances for additional training pairs.</Paragraph> <Paragraph position="2"> 6.4.1 Training and Testing on WordNet We trained and tested five classifiers using 10fold cross validation on our set of WordNet-labeled query segment pairs. Results for each class are shown in Table 7. We seem to have regressed to predicting no vs. some relationship.</Paragraph> <Paragraph position="3"> Because these results are not as good as the human-labeled results, we believe that some of our performance must be due to peculiarities of our data. That is not unexpected: since words that appear in WordNet are very common, features are much noisier than features associated with query entities that are often structured within web pages. We took the five classes for which human and WordNet definitions agreed (synonyms, coordinates, hypernyms, hyponyms, and no relationship) and trained classifiers on all WordNet-labeled instances. We tested the classifiers on human-labeled instances from just those five classes. Results are shown in Table 8. Performance was not very good, reinforcing the idea that while our features can distinguish between query segments, they cannot distinguish between common words.</Paragraph> </Section> </Section> class="xml-element"></Paper>