XML Viewer - e06-3009

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/e06-3009_metho.xml
Size: 20,385 bytes
Last Modified: 2025-10-06 14:10:05
<?xml version="1.0" standalone="yes"?>
<Paper uid="E06-3009">
  <Title>Example-Based Metonymy Recognition for Proper Nouns</Title>
  <Section position="3" start_page="71" end_page="74" type="metho">
    <SectionTitle>
2 Example-based metonymy recognition
</SectionTitle>
    <Paragraph position="0"> As I have argued, Nissim and Markert's (2003) approach to metonymy recognition is quite complex. I therefore wanted to see if this complexity can be dispensed with, and if it can be replaced with the much more simple algorithm of Memory-Based Learning. The advantages of Memory-Based Learning (MBL), which is implemented in the TiMBL classifier (Daelemans et al., 2004)1, are twofold. First, it is based on a plausible psychological hypothesis of human learning. It holds that people interpret new examples of a phenomenon by comparing them to &amp;quot;stored representations of earlier experiences&amp;quot; (Daelemans et al., 2004, p.19). This contrasts to many other classification algorithms, such as Naive Bayes, whose psychological validity is an object of heavy debate. Second, as a result of this learning hypothesis, an MBL classifier such as TiMBL eschews the formulation of complex rules or the computation of probabilities during its training phase. Instead itstores alltraining vectors toits memory, together with their labels. In the test phase, it computes the distance between thetest vector and allthese train- null ing vectors, and simply returns the most frequent label of the most similar training examples.</Paragraph>
    <Paragraph position="1"> One of the most important challenges in Memory-Based Learningisadapting thealgorithm to one's data. This includes finding a representative seed set as well as determining the right distance measures. For my purposes, however, TiMBL's default settings proved more than satisfactory. TiMBL implements the IB1 and IB2 algorithms that were presented inAhaet al. (1991), but adds a broad choice of distance measures. Its default implementation of the IB1 algorithm, which is called IB1-IG in full (Daelemans and Van den Bosch, 1992), proved most successful in my experiments. It computes the distance between two vectors X and Y by adding up the weighted distances d between their corresponding feature values xi and yi:</Paragraph>
    <Paragraph position="3"> The most important element in this equation is the weight that is given to each feature. In IB1-IG, features are weighted by their Gain Ratio (equation 4), the division of the feature's Information Gain by its split info. Information Gain, the numerator in equation (4), &amp;quot;measures how much information it [feature i] contributes to our knowledge of the correct class label [...] by computing the difference in uncertainty (i.e. entropy) between the situations without and with knowledge of the value of that feature&amp;quot; (Daelemans et al., 2004, p.20). In order not &amp;quot;to overestimate the relevance of features with large numbers of values&amp;quot; (Daelemans et al., 2004, p.21), this Information Gain is then divided by the split info, the entropy of the feature values (equation 5). In the following equations, C is the set of class labels, H(C) is the entropy of that set, and Vi is the set of values for feature i.</Paragraph>
    <Paragraph position="5"> The IB2 algorithm wasdeveloped alongside IB1 in order to reduce storage requirements (Aha et al., 1991). It iteratively saves only those instances that are misclassified by IB1. This isbecause these will likely lie close to the decision boundary, and hence, be most informative to the classifier. My experiments showed, however, that IB2's best performance lay more than 2% below that of IB1. It will therefore not be treated any further here.</Paragraph>
    <Section position="1" start_page="72" end_page="73" type="sub_section">
      <SectionTitle>
2.1 Experiments with grammatical
</SectionTitle>
      <Paragraph position="0"> information only In order to see if Memory-Based Learning is able to replicate Nissim and Markert's (2003; 2005) results, I used their corpora for a number of experiments. These corpora consist of one set with about 1000 mixed country names, another with 1000 occurrences of Hungary, and a final set with about 1000 mixed organization names.2 Evaluation was performed with ten-fold cross-validation.</Paragraph>
      <Paragraph position="1"> The first round of experiments used only grammatical information. The experiments for the location data were similar to Nissim and Markert's (2003), and took the following features into  account: * the grammatical function of the word (subj, obj, iobj, pp, gen, premod, passive subj, other); * its head; * the presence of a second head; * the second head (if present).</Paragraph>
      <Paragraph position="2"> The experiments for the organization names used the same features as Nissim and Markert (2005): * the grammatical function of the word; * its head; * its type of determiner (if present) (def, indef, bare, demonst, other); * its grammatical number (sing, plural); * its number of grammatical roles (if present).  The number of words in the organization name, which Nissim and Markert used as a sixth and final feature, led to slightly worse results in my experiments and was therefore dropped.</Paragraph>
      <Paragraph position="3"> The results of these first experiments clearly beat the baselines of 79.7% (countries) and 63.4% (organizations). Moreover, despite its extremely  simple learning phase, TiMBL is able to replicate the results from Nissim and Markert (2003; 2005). As table 1 shows, accuracy for the mixed country data is almost identical to Nissim and Markert's figure, and precision, recall and F-score for the metonymical class lie only slightly lower.3 TiMBL's results for the Hungary data were similar, and equally comparable to Markert and Nissim's (Katja Markert, personal communication). Note, moreover, that these results were reached with grammatical information only, whereas Nissim and Markert's (2003) algorithm relied on semantics as well.</Paragraph>
      <Paragraph position="4"> Next, table 2 indicates that TiMBL's accuracy forthemixedorganization dataliesabout1.5%below Nissim and Markert's (2005) figure. This result should be treated with caution, however. First, Nissim and Markert's available organization data had not yet been annotated for grammatical features, and my annotation may slightly differ from theirs. Second, Nissim and Markert used several feature vectors for instances with more than one grammatical role and filtered all mixed instances from thetraining set. Atestinstance wastreated as mixedonly when its several feature vectors were classified differently. My experiments, in contrast, were similar to those for the location data, in that each instance corresponded to one vector. Hence, the slightly lower performance of TiMBL is probably due to differences between the two experiments. null These first experiments thus demonstrate that Memory-Based Learning can give state-of-the-art performance in metonymy recognition. In this respect, it is important to stress that the results for the country data were reached without any semantic information, whereas Nissim and Markert's (2003) algorithm used Dekang Lin's (1998) clusters of semantically similar words in order to deal with data sparseness. This fact, together 3Precision, recall and F-score are given for the metonymical class only, since this isthe category that metonymy recognition is concerned with.</Paragraph>
    </Section>
    <Section position="2" start_page="73" end_page="74" type="sub_section">
      <SectionTitle>
2.2 Experiments with semantic and
</SectionTitle>
      <Paragraph position="0"> grammatical information It is still intuitively true, however, that the interpretation of a possibly metonymical word depends mainly on the semantics of its head. The question is if this information is still able to improve the classifier's performance. Itherefore performed a second round of experiments with the location data, in which I also made use of semantic information. In this round, I extracted the hypernym synsets of the head's first sense from WordNet.</Paragraph>
      <Paragraph position="1"> WordNet's hierarchy of synsets makes it possible to quantify the semantic relatedness of two words: the more hypernyms two words share, the more closely related they are. I therefore used the ten highest hypernyms of the first head as features 5 to 14. For those heads with fewer than ten hypernyms, a copy of their lowest hypernym filled the 'empty' features. As a result, TiMBL would first look for training instances with ten identical hypernyms, then with nine, etc. It would thus comparethetestexampletothesemantically mostsimilar training examples.</Paragraph>
      <Paragraph position="2"> However, TiMBL did not perform better with this semantic information. Although F-scores for the metonymical category went up slightly, the system's accuracy hardly changed. This result was not due to the automatic selection of the first(most frequent) WordNet sense. By manually disambiguating all the heads in the training and test set of the country data, I observed that this first sense was indeed often incorrect, but that choosing the correct sense did not lead to a more robust system.</Paragraph>
      <Paragraph position="3"> Clearly, the classifier did not benefit from Word-Net information as Nissim and Markert's (2003) did from Lin's (1998) thesaurus.</Paragraph>
      <Paragraph position="4"> The learning curves for the country set allow us to compare the two types of feature vectors  in more detail.4 As figure 1 indicates, with respect to overall accuracy, semantic features have a negative influence: the learning curve with both features climbs much more slowly than that with only grammatical features. Hence, contrary to my expectations, grammatical features seem to allow a better generalization from a limited number of training instances. With respect to the F-score on the metonymical category in figure 2, the differences are much less outspoken. Both features give similar learning curves, but semantic features lead to a higher final F-score. In particular, the use of semantic features results in a lower precision figure, but a higher recall score. Semantic features thus cause the classifier to slightly overgeneralize from the metonymic training examples.</Paragraph>
      <Paragraph position="5"> There are two possible reasons for this inability of semantic information to improve the classifier's performance. First, WordNet's synsets do not always map well to one of our semantic labels: many are rather broad and allow for several readings of the target word, while others are too specific to make generalization possible. Second, there is the predominance of prepositional phrases in our data. With their closed set of heads, the number of examples that benefits from semantic information about its head is actually rather small.</Paragraph>
      <Paragraph position="6"> Nevertheless, my first round of experiments has indicated that Memory-Based Learning is a simple but robust approach to metonymy recognition. It is able to replace current approaches that needsmoothing oriterative searches through athesaurus, with a simple, distance-based algorithm.</Paragraph>
      <Paragraph position="7"> 4These curves were obtained by averaging the results of 10 experiments. They show performance on a test set of 40% of the data, with the other 60% as training data.</Paragraph>
      <Paragraph position="8">  Moreover, in contrast to some other successful classifiers, it incorporates a plausible hypothesis of human learning.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="74" end_page="76" type="metho">
    <SectionTitle>
3 Distance-based sample selection
</SectionTitle>
    <Paragraph position="0"> Theprevious section has shownthat asimple algorithm that compares test examples to stored training instances is able to produce state-of-the-art results in the field of metonymy recognition. This leads to the question of how many examples we actually need to arrive at this performance. After all, the supervised approach that we explored requires the careful manual annotation of a large number of training instances. This knowledge acquisition bottleneck compromises the extrapolation of this approach to a large number of semantic classes and metonymical patterns. This section will therefore investigate if it is possible to automatically choose informative examples, sothatannotation effort can be reduced drastically.</Paragraph>
    <Paragraph position="1"> For this round of experiments, two small changes were made. First, since we are focusing on metonymy recognition, I replaced all specific metonymical labels with the label met, so that only three labels remain: lit, met and mixed.</Paragraph>
    <Paragraph position="2"> Second, whereas the results in theprevious section were obtained with ten-fold cross-validation, I ran these experiments with a training and a test set.</Paragraph>
    <Paragraph position="3"> On each run, I used a random 60% of the data for training; 40% was set aside for testing. All curves give the average of twenty test runs that use grammatical information only.</Paragraph>
    <Paragraph position="4"> In general, sample selection proceeds on the basis of the confidence that the classifier has in its classification. Commonly used metrics are the probability of the most likely label, or the entropy  try data with random and maximum-distance selection of training examples.</Paragraph>
    <Paragraph position="5"> over all possible labels. The algorithm then picks those instances with the lowest confidence, since these will contain valuable information about the training set (and hopefully also the test set) that is still unknown to the system.</Paragraph>
    <Paragraph position="6"> One problem with Memory-Based Learning algorithms is that they do not directly output probabilities. Since they are example-based, they can only give the distances between the unlabelled instance and all labelled training instances. Nevertheless, these distances can be used as a measure of certainty, too: we can assume that the system is most certain about the classification of test instances that lie very close to one or more of its training instances, and less certain about those that are further away. Therefore the selection function that minimizes the probability of the most likely label can intuitively be replaced by one that maximizes the distance from the labelled training instances. null However, figure 3 shows that for the mixed country instances, this function is not an option.</Paragraph>
    <Paragraph position="7"> Both learning curves give the results of an algorithm that starts with fifty random instances, and then iteratively adds ten new training instances to this initial seed set. Thealgorithm behind the solid curve chooses these instances randomly, whereas the one behind the dotted line selects those that are most distant from the labelled training examples. In the first half of the learning process, both functions are equally successful; in the second the distance-based function performs better, but only slightly so.</Paragraph>
    <Paragraph position="8"> There are two reasons for this bad initial performance of the active learning function. First, it is not able to distinguish between informative and  try data with random and maximum/minimumdistance selection of training examples. unusual training instances. This is because a large distance from the seed set simply means that the particular instance's feature values are relatively unknown. This does not necessarily imply that the instance is informative to the classifier, however. After all, it may be so unusual and so badly representative of the training (and test) set that the algorithm had better exclude it -- something that is impossible on the basis of distances only. This biastowards outliers isawell-known disadvantage of many simple active learning algorithms. A second type of bias is due to the fact that the data has beenannotated withafewfeatures only. Moreparticularly, the present algorithm will keep adding instances whose head is not yet represented in the training set. This entails that it will put off adding instances whose function is pp, simply because other functions (subj, gen, ...) have a wider variety in heads. Again, the result is a labelled set that is not very representative of the entire training set.</Paragraph>
    <Paragraph position="9"> There are, however, a few easy ways to increase the number of prototypical examples in the training set. In a second run of experiments, I used an active learning function that added not only those instances that were most distant from the labelled training set, but also those that were closest to it. After a few test runs, I decided to add six distant andfourcloseinstances oneachiteration. Figure4 showsthatsuch afunction isindeed fairly successful. Because it builds a labelled training set that is more representative of the test set, this algorithm clearly reduces the number of annotated instances that is needed to reach a given performance.</Paragraph>
    <Paragraph position="10"> Despite its success, this function is obviously notyet asophisticated wayof selecting good train- null zation data with random and distance-based (AL) selection of training examples with a random seed set.</Paragraph>
    <Paragraph position="11"> ing examples. The selection of the initial seed set in particular can be improved upon: ideally, this seed set should take into account the overall distribution of the training examples. Currently, the seeds are chosen randomly. This flaw in the algorithm becomes clear if it is applied to another data set: figure 5 shows that it does not outperform random selection on the organization data, for instance.</Paragraph>
    <Paragraph position="12"> As I suggested, the selection of prototypical or representative instances as seeds can be used to make the present algorithm more robust. Again, it is possible to use distance measures to do this: before the selection of seed instances, the algorithm can calculate for each unlabelled instance its distance from each of the other unlabelled instances. In this way, it can build a prototypical seed set by selecting those instances with the smallest distance on average. Figure 6 indicates that such an algorithm indeed outperforms random sample selection on the mixed organization data. For the calculation of the initial distances, each feature received the same weight. The algorithm then selected 50 random samples from the 'most prototypical' half of the training set.5 The other settings were the same as above.</Paragraph>
    <Paragraph position="13"> With thepresent small number of features, however, such a prototypical seed set is not yet always as advantageous as it could be. A few experiments indicated that it did not lead to better performance on the mixed country data, for instance. However, as soon as a wider variety of features is taken into account (as with the organization data), the advan- null zation data with random and distance-based (AL) selection of training examples with a prototypical seed set.</Paragraph>
    <Paragraph position="14"> tages of a prototypical seed set will definitely become more obvious.</Paragraph>
    <Paragraph position="15"> In conclusion, it has become clear that a careful selection of training instances may considerably reduce annotation effort in metonymy recognition.</Paragraph>
    <Paragraph position="16"> Functions that construct a prototypical seed set and then use MBL's distance measures to select informative as well as typical samples are extremely promising in this respect and can already considerably reduce annotation effort. In order to reach an accuracy of 85% on the country data, for instance, the active learning algorithm above needs 44%fewertraining instances thanitsrandom competitor (on average). On the organisation data, reduction is typically around 30%. These relatively simple algorithms thus constitute a good basis for the future development of robust active learning techniques for metonymy recognition. I believe in particular that research in this field should go hand in hand with an investigation of new informative features, since the present, limited feature set does not yet allow us to measure the classifier's confidence reliably.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML