File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/n06-1017_metho.xml
Size: 18,985 bytes
Last Modified: 2025-10-06 14:10:10
<?xml version="1.0" standalone="yes"?> <Paper uid="N06-1017"> <Title>Unknown word sense detection as outlier detection</Title> <Section position="5" start_page="129" end_page="129" type="metho"> <SectionTitle> 3 Experimental setup and data </SectionTitle> <Paragraph position="0"> Experimental setup. To evaluate an unknown sense detection system, we need occurrences that are guaranteed not to belong to any of the seen senses.</Paragraph> <Paragraph position="1"> To that end we use sense-annotated data, in our case the FrameNet annotated sentences, simulating unknown senses by designating one sense of each ambiguous lemma as unknown. All occurrences of that sense are placed in the test set, while occurrences of all other senses are split randomly between training and test set, using 5-fold cross-validation. We repeat the experiment with each of the senses of an ambiguous lemma playing the part of the unknown sense once. Viewing each cross-validation run for each unknown sense as a separate experiment, we then report precision and recall averaged over unknown senses and cross-validation runs.</Paragraph> <Paragraph position="2"> It may seem questionable that in this experimental setup, the unknown sense occurrences of each lemma all belong to the same sense. However, this does not bias the experiment since none of the models we study take advantage of the shape of the test set in any way. Rather, each test item is classified individually, without recourse to the other test items.</Paragraph> <Paragraph position="3"> Data. All experiments in this paper were performed on the FrameNet 1.2 annotated data pertaining to ambiguous lemmas. After removal of instances that were annotated with more than one sense, we obtain 26,496 annotated sentences for the 1,031 ambiguous lemmas. They were parsed with Minipar (Lin, 1993); named entities were computed using Heart of Gold (Callmeier et al., 2004).</Paragraph> </Section> <Section position="6" start_page="129" end_page="130" type="metho"> <SectionTitle> 4 Experiment 1: WSD confidence scores </SectionTitle> <Paragraph position="0"> for unknown sense detection In this section we test a very simple model of unknown sense detection: Classifiers often return a confidence score along with the assigned label. We will try to detect unknown senses by a threshold on confidence scores, declaring anything below the threshold as unknown. Note that this method can only be applied to lemmas that have more than one sense, since for single-sense lemmas the system will always return the maximum confidence score.</Paragraph> <Paragraph position="1"> Data. While the approach that we follow in this section is applicable to all lemmas with at least two matical function features senses, we need lemmas with at least three senses to evaluate it: One of the senses of each lemma is treated as unknown, which for lemmas with three or more senses leaves at least two senses for the training set. This reduces our data set to 125 lemmas with 7,435 annotated sentences.</Paragraph> <Paragraph position="2"> Modeling. We test whether the WSD system built into SHALMANESER (Erk, 2005) can distinguish known sense items from unknown sense items reliably by its confidence scores. The system extracts a rich feature set, which forms the basis of all three experiments in this paper: * a bag-of-words context, with a window size of one sentence; * bi- and trigrams centered on the target word; * grammatical function information: for each dependent of the target, (1) its function label, (2) its headword, and (3) a combination of both are used as features. (4) The concatenation of all function labels constitutes another feature. For PPs, function labels are extended by the preposition. As an example, Figure 2 shows a BNC sentence and its grammatical function features.</Paragraph> <Paragraph position="3"> * for verb targets, the target voice.</Paragraph> <Paragraph position="4"> The feature set is based on Florian et al. (2002) but contains additional syntax-related features. Each word-related feature is represented as four features for word, lemma, part of speech, and named entity. SHALMANESER trains one Naive Bayes classifier per lemma to be disambiguated. For this experiment, sense, WSD confidence level approach. th: confidence threshold. s: std. dev.</Paragraph> <Paragraph position="5"> all system parameters were set to their default settings. To detect unknown senses building on this WSD system, we use a fixed confidence threshold and label all items below the threshold as unknown. Results and discussion. Table 1 shows precision and recall for labeling instances as unknown using different confidence thresholds th, averaged over unknown senses and 5-fold cross-validation3. We see that while the precision of this method is acceptable at 0.74 to 0.765, recall is extremely low, i.e. almost no items were labeled unknown, even at a threshold of 0.98. However, SHALMANESER has very high confidence values overall: Only 14.5% of all instances in this study had a confidence value of 0.98 or below (7,697 of 53,206).</Paragraph> <Paragraph position="6"> We conclude that with the given WSD system and (rather standard) features, this simple method cannot detect items with an unknown sense reliably. This may be due to the indiscriminately high confidence scores; or it could indicate that classifiers, which are geared at distinguishing between known classes rather than detecting objects that differ from all seen data, are not optimally suited to the task. However, one further disadvantage of this approach is that, as mentioned above, it can only be applied to lemmas with more than one annotated sense. For FrameNet 1.2, this comprises only 19% of the lemmas.</Paragraph> </Section> <Section position="7" start_page="130" end_page="131" type="metho"> <SectionTitle> 5 A nearest neighbor-based method for </SectionTitle> <Paragraph position="0"> outlier detection In the previous section we have tested a simple approach to unknown sense detection using WSD confidence scores. Our conclusion was that it was not a viable approach, given its low recall and given that 3Note that the minimum confidence score is 0.5 if 2 senses are present in the training set, 0.33 for 3 present senses etc.</Paragraph> <Paragraph position="2"> between nearest neighbors it is only applicable to lemmas with more than one known sense. In this section we introduce an alternative approach, which uses distances to nearest neighbors to detect outliers.</Paragraph> <Paragraph position="3"> In general, the task of outlier detection is to decide whether a new object belongs to a given training set or not. Typically, outlier detection approaches derive some boundary around the training set, or they derive from the set some model of &quot;normality&quot; to which new objects are compared (Markou and Singh, 2003a; Markou and Singh, 2003b; Marsland, 2003). Applications of outlier detection include fault detection (Hickinbotham and Austin, 2000), hand writing deciphering (Tax and Duin, 1998; Sch&quot;olkopf et al., 2000), and network intrusion detection (Yeung and Chow, 2002; Dasgupta and Forrest, 1999). One standard approach to outlier detection estimates the probability density of the training set, such that a test object can be classified as an outlier or non-outlier according to its probability of belonging to the set.</Paragraph> <Paragraph position="4"> Rather than estimating the complete density function, Tax and Duin (2000) approximate local density at the test object by comparing distances between nearest neighbors. Given a test object x, the approach considers the training object t nearest to x and compares the distance dxt between x and t to the distance dttprime between t and its own nearest training data neighbor tprime. Then the quotient between the distances is used as an indicator of the (ab-)normality of the test object x:</Paragraph> <Paragraph position="6"> When the distance dxt is much larger than dttprime, x is considered an outlier. Figure 3 illustrates the idea.</Paragraph> <Paragraph position="7"> The normality or abnormality of test objects is decided by a fixed threshold th on pNN. The lowest threshold that makes sense is 1.0, which rejects any x that is further apart from its nearest training neighbor t than t is from its neighbor. Tax and Duin use Euclidean distance, i.e.</Paragraph> <Paragraph position="9"> Applied to feature vectors with entries either 0 or 1, this corresponds to the size of the symmetric difference of the two feature sets.</Paragraph> </Section> <Section position="8" start_page="131" end_page="133" type="metho"> <SectionTitle> 6 Experiment 2: NN-based outlier </SectionTitle> <Paragraph position="0"> detection In this section we use the NN-based outlier detection approach of the previous section for an experiment in unknown sense detection. Experimental setup and data are as described in Section 3.</Paragraph> <Paragraph position="1"> Modeling. We model unknown sense detection as an outlier detection task, using Tax and Duin's outlier detection approach that we have outlined in the previous section. Nearest neighbors (by Euclidean distance) were computed using the ANN tool (Mount and Arya, 2005). We compute one outlier detection model per lemma. With training and test sets constructed as described in Section 3, the average training set comprises 22.5 sentences.</Paragraph> <Paragraph position="2"> We use the same features as in Section 4, with feature vector entries of 1 for present and 0 for absent features. For a more detailed analysis of the contribution of different feature types, we test on reduced as well as full feature vectors: All: full feature vectors Cx: only bag-of-word context features (words, lemmas, POS, NE) Syn: function labels of dependents Syn-hw: Syn plus headwords of dependents We compare the NN-based model to that of Experiment 1, but not to any simpler baseline.</Paragraph> <Paragraph position="3"> While for WSD it is possible to formulate simple frequency-based methods that can serve as a baseline, this is not so in unknown sense detection because the frequency of unknown senses is, by definition, unknown. Furthermore, the number of annotated sentences per sense in FrameNet depends</Paragraph> <Paragraph position="5"> on the number of subcategorization frames of the lemma rather than the frequency of the sense, which makes frequency calculations meaningless.</Paragraph> <Paragraph position="6"> Results. Table 2 shows precision and recall for labeling instances as unknown using a distance quotient threshold of th=1.0, averaged over unknown senses and over 5-fold cross-validation. We see that recall is markedly higher than in Experiment 1, especially for the two conditions that include context words, All and Cx. The syntax-based conditions Syn and Syn-hw show a higher precision, with a less pronounced increase in recall.</Paragraph> <Paragraph position="7"> Raising the distance quotient threshold results in little change in precision, but a large drop in recall. For example, All vectors with a threshold of th =</Paragraph> <Section position="1" start_page="131" end_page="132" type="sub_section"> <SectionTitle> 1.1 achieve a recall of 0.14 in comparison to 0.27 </SectionTitle> <Paragraph position="0"> for th = 1.0 .</Paragraph> <Paragraph position="1"> Training set size is an important factor in system results. Table 3 lists precision and recall for all training sets, for training sets of size [?] 10, and for training sets of size [?] 20. Especially in conditions All and Cx, recall rises steeply when we only consider cases with larger training sets. However note that precision does not rise with larger training sets, rather it shows a slight decline.</Paragraph> <Paragraph position="2"> Another important factor is the number of senses that a lemma has, as the upper part of Table 7 shows. For lemmas with a higher number of senses, preci- null sion is much lower, while recall is much higher.</Paragraph> <Paragraph position="3"> Discussion. While results in this experiment are better than in Experiment 1 - in particular recall has risen by 19 points for Cx -, system performance is still not high enough to be usable in practice.</Paragraph> <Paragraph position="4"> The uniformity of the training set has a large influence on performance, as Table 7 shows. The more senses a lemma has, the harder it seems to be for the model to identify known sense occurrences. Precision for the assignment of the unknown label drops, while recall rises. We see a tradeoff between precision and recall, in this table as well as in Table 3. There, we see that many more unknown test objects are identified when training sets are larger, but a larger training set does not translate into universally higher results.</Paragraph> <Paragraph position="5"> One possible explanation for this lies in a prop-erty of Tax and Duin's approach. If a training item t is situated at distance d from its nearest neighbor in the training set, then any test item within a radius of d around t will be considered known. Thus we could term d the &quot;acceptance radius&quot; of t. Now if t is an outlier within the training set, then d will be large, as illustrated in Figure 4. The sparser the training set is, the more training outliers we are likely to find, with large acceptance radii that assign a label of known even to more distanced test items. Thus a sparse training set could lead to lower recall of unknown sense assignment and at the same time higher precision, as the items labeled unknown would be the ones at great distance from any items on the training set - conforming to the pattern in Tables 3 and 7.</Paragraph> <Paragraph position="6"> 7 Experiment 3: NN-based outlier detection with added training data While the NN-based outlier detection model we used in the previous experiment showed better re-Target lemma: put Senses: ENCODING, PLACING Sense currently treated as unknown: PLACING Extend training set by: all annotated sentences for lemmas other than put in the sense ENCODING: couch.v, expression.n, formulate.v, formulation.n, frame.v, phrase.v, word.v, wording.n sense, NN-based outlier detection, th = 1.0. s: standard deviation sults than the WSD confidence model, its recall is still low. We have suggested that data sparseness may be responsible for the low performance. Consequently, we repeat the experiment of the previous section with more, but less specific, training data. Like WordNet synsets, FrameNet frames are semantic classes that typically comprise several lemmas or expressions. So, assuming that words with similar meaning occur in similar contexts, the context features for lemmas in the same frame should be similar. Following this idea, we supplement the training data for a lemma by all the other annotated data for the senses that are present in the training set, where by &quot;other data&quot; we mean data with other target lemmas. Table 4 shows an example4.</Paragraph> <Paragraph position="7"> Modeling. Again, we use Tax and Duin's outlier detection approach for unknown sense detection.</Paragraph> <Paragraph position="8"> The experimental design and evaluation are the same as in Experiment 2, the only difference being the training set extension. Training set extension raises the average training set size from 22.5 to 374.</Paragraph> <Paragraph position="9"> Results. Table 5 shows precision and recall for labeling instances as unknown, with a distance quotient threshold of 1.0, averaged over unknown senses</Paragraph> </Section> <Section position="2" start_page="132" end_page="133" type="sub_section"> <SectionTitle> 4Conditions Syn and Syn-hw were also tested using only </SectionTitle> <Paragraph position="0"> other target lemmas with the same part of speech. Results were virtually unchanged.</Paragraph> <Paragraph position="1"> ber of senses of a lemma, condition All, th = 1.0 and 5-fold cross-validation. In comparison to Experiment 2, precision has risen slightly, and for conditionsAll, CxandSyn-hw, recall has risen steeply; the maximum recall is achieved by Cx at 0.82.</Paragraph> <Paragraph position="2"> As before, increasing the distance quotient threshold leads to little change in precision but a sharp drop in recall. For All vectors, recall is 0.72 for threshold 1.0, 0.56 for th = 1.1, and 0.41 for th = 1.2. Table 6 shows system performance by training set size. As the average training set in this experiment is much larger than in Experiment 2, we are now inspecting sets of minimum size 50 and 200 rather than 10 and 20. We find the same effect as in Experiment 2, with noticeably higher recall for lemmas with larger training sets, but slightly lower precision. Table 7 breaks down system performance by the degree of ambiguity of a lemma. Here, too, we see the same effect as in Experiment 2: the more senses a lemma has, the lower the precision and the higher the recall of unknown label assignment.</Paragraph> <Paragraph position="3"> Discussion. In comparison to Experiment 2, Experiment 3 shows a dramatic increase in recall, and even some increase in precision. Precision and recall for conditions All and Cx are good enough for the system to be usable in practice.</Paragraph> <Paragraph position="4"> Of the four conditions, the three that involve context words, All, Cx and Syn-hw, show considerably higher recall than Syn. Furthermore, the two conditions that do not involve syntactic features, All and Cx, have markedly higher results than Syn-hw. This could mean that syntactic features are not as helpful as context features in detecting unknown senses; however in Experiment 2 the performance difference between Syn and the other conditions was not by far as large as in this experiment. It could also mean that frames are not as uniform in their syntactic structure as they are in their context words. This seems plausible as FrameNet frames are constructed mostly on semantic grounds, without recourse to similarity in syntactic structure.</Paragraph> <Paragraph position="5"> Table 6 points to a sparse data problem, even with training sets extended by additional items. It also shows that the more a test condition relies on context word information, the more it profits from additional data. So it may be worthwhile to explore methods for a further alleviation of data sparseness, e.g. by generalizing over context words.</Paragraph> <Paragraph position="6"> Table 7 underscores the large influence of training set uniformity: the more senses a lemma has, the more likely the model is to classify a test instance as unknown. This is the case even for extended training sets. One possible way of addressing this problem would be to take into account more than a single nearest neighbor in NN-based outlier detection in order to compute more precise boundaries between known and unknown instances.</Paragraph> </Section> </Section> class="xml-element"></Paper>