XML Viewer - w06-2504

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-2504_metho.xml
Size: 20,248 bytes
Last Modified: 2025-10-06 14:10:52
<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-2504">
  <Title>What's in a name? The automatic recognition of metonymical location names.</Title>
  <Section position="4" start_page="25" end_page="29" type="metho">
    <SectionTitle>
2 An unsupervised approach to
</SectionTitle>
    <Paragraph position="0"> metonymy recognition</Paragraph>
    <Section position="1" start_page="25" end_page="26" type="sub_section">
      <SectionTitle>
2.1 Background
</SectionTitle>
      <Paragraph position="0"> Unsupervised machine learning algorithms do not need any labelled training examples. Instead, the machine itself has to try and group the training instances into a pre-defined number of clusters, which ideally correspond to the implicit target labels. The approach studied here is Sch&amp;quot;utze's (1998) Word Sense Discrimination, which uses second-order co-occurrence in order to identify clusters of senses.</Paragraph>
      <Paragraph position="1"> Sch&amp;quot;utze's (1998) algorithm first maps all words in the training corpus onto word vectors, which contain frequency information about the word's first-order co-occurrents. It then builds a vector representation for each of the contexts of the target by adding up the word vectors of the words in this context. These second-order context vectors get clustered (often after some form of dimensionality reduction), and each of the clusters is assumed to correspond to one of the senses of the target. The classification of a test word, finally, proceeds by assigning it to the cluster whose centroid lies nearest to its context vector. Sch&amp;quot;utze showed that, with  about 8,000 training instances on average, this algorithm obtains very promising results.</Paragraph>
      <Paragraph position="2"> This unsupervised algorithm is not just attractive from a computational point of view; it is also related to human behaviour. First, it was inspired by Miller and Charles' (1991) observation that humans rely on contextual similarity in order to determine semantic similarity. Sch&amp;quot;utze (1998) therefore hypothesized that there must be a correlation between contextual similarity and word meaning as well: &amp;quot;a sense is a group of contextually similar occurrences of a word&amp;quot; (Sch&amp;quot;utze, 1998, p.99). Second, this algorithm lies at the basis of Latent Semantic Analysis (LSA). Although the psycholinguistic merits of LSA are an object of debate, its performance in several language tasks compares well to that of humans (Landauer and Dumais, 1997). Let us therefore investigate if it is able to tackle metonymy recognition as well.</Paragraph>
      <Paragraph position="3"> Sch&amp;quot;utze's (1998) approach has been implemented in the SenseClusters program (Purandare and Pedersen, 2004)2, which also incorporates some interesting variations on and extensions to the original algorithm. First, Purandare and Pedersen (2004) defend the use of bigram features instead of simple word features. Bigrams are &amp;quot;ordered pairs of words that co-occur within five positions of each other&amp;quot; (Purandare and Pedersen, 2004, p.2) and will be used throughout this paper. Second, they also found that the hybrid algorithm of Repeated Bisections performs better than Sch&amp;quot;utze's (1998) clustering algorithm -- at least for sparse data -- so I will use it here, too. Finally, as with all word sense discrimination techniques, evaluation proceeds indirectly: SenseClusters automatically finds the alignment of senses and clusters that leads to the fewest misclassifications -the confusion matrix that maximizes the diagonal sum.</Paragraph>
    </Section>
    <Section position="2" start_page="26" end_page="28" type="sub_section">
      <SectionTitle>
2.2 Experiments
</SectionTitle>
      <Paragraph position="0"> On the basis of Markert and Nissim's location corpora, I tested if unsupervised learning can be applied to metonymy recognition. 60% of the instances were used as training data, 40% as test data, and the number of pre-defined clusters was set to two. The experiments were designed with five specific research questions in mind:  with one-word sets? Since the unsupervised WSD approach studied here uses lexical features only, I anticipated it to work better with the Hungary data than with the mixed country set. After all, we can expect one word to have fewer typical co-occurrences than an entire semantic class, so its contexts may be easier to cluster.</Paragraph>
      <Paragraph position="1"> * Should a stoplist be used? Unsupervised clustering on the basis of co-occurrences usually ignores a number of words that are thought to be uninformative about the reading of the target. Examples of such words are prepositions and extremely frequent verbs (be, give, go, . . . ). In metonymy recognition, however, these words may be much more useful than in classic WSD. If a location name occurs in a prepositional phrase with in, for instance, it is probably used literally. Similarly, verbs such as give and go determine the interpretation of a possibly metonymical word in contexts like give sth. to a country (metonymical) and go to a country (literal). Stoplists may therefore be less useful in metonymy recognition.</Paragraph>
      <Paragraph position="2"> * Are smaller context windows better than large ones? Markert and Nissim (2002a) discovered that, with co-occurrence features, the reduction of window sizes from 10 to about 3 led to a radical improvement in precision (from 25% to above 50%) and recall (from 4% to above 20%). Sch&amp;quot;utze's (1998) original algorithm, however, used context windows of 25 words on either side of the target.</Paragraph>
      <Paragraph position="3"> * Does Singular Value Decomposition result in better performance?3 Sch&amp;quot;utze (1998) found that his algorithm performs better with SVD than without. SVD is said to abstract away from word dimensions, and to discover topical dimensions instead.</Paragraph>
      <Paragraph position="4"> This helps tackle vocabulary issues such as synonymy and polysemy, and moreover addresses data sparseness. However, as Markert and Nissim (2002a) argue, the sense distinctions between the literal and metonymical meanings of a word are not of a topical  likelihood test to select their features, probably because of the intuition that &amp;quot;candidate words whose occurrence depends on whether the ambiguous word occurs will be indicative of one of the senses of the ambiguous word and hence useful for disambiguation&amp;quot; (Sch&amp;quot;utze, 1998, p.102). Sch&amp;quot;utze, in contrast, found that statistical selection is outperformed by frequency-based selection when SVD is not used.</Paragraph>
      <Paragraph position="5"> Like Nissim and Markert (2003), I used four measures to evaluate the experimental results: precision, recall and F-score for the metonymical category, and overall accuracy. They are defined in the following way:  If the rows represent the correct labels and the columns the labels returned by the classifier, we get the following results:</Paragraph>
      <Paragraph position="7"> In engineering terms, a WSD system is only useful when its accuracy beats the so-called majority baseline. This is the accuracy of a system that simply gives the same, most frequent, label to all test instances. Such a classifier reaches an accuracy of 79.46% on the test corpus of mixed country names and of 77.35% on the test corpus with instances of Hungary.</Paragraph>
    </Section>
    <Section position="3" start_page="28" end_page="29" type="sub_section">
      <SectionTitle>
2.3 Experimental results
</SectionTitle>
      <Paragraph position="0"> Compared to this majority baseline, the results of the unsupervised approach fall below the mark.</Paragraph>
      <Paragraph position="1"> None of the accuracy values in tables 1, 2 and 3 lies above this baseline. With baselines of almost 80%, however, this result comes as no surprise.</Paragraph>
      <Paragraph position="2"> Moreover, the classifier's failure to beat the majority baseline does not necessarily mean that it is unable to identify a 'metonymical' and a 'literal' cluster in the data. This ability should be investigated with a kh2-test instead, which helps us determine if there is a correlation between a test instance's cluster on the one hand and its label on the other. If we compare the results with this kh2baseline, it emerges that in many cases, the identified clusters indeed significantly correlate with the reading of the target words. The default (+LL +SVD) algorithm, for instance, typically identifies a metonymical and a literal cluster in the mixed country data (table 1). It also becomes clear that the best algorithms are not those with the highest accuracy values. After all, an accuracy close to the baseline often results from the identification of one huge 'literal' cluster that covers most metonymies as well.</Paragraph>
      <Paragraph position="3"> Let us now evaluate the algorithms with respect to the five research questions I mentioned above.</Paragraph>
      <Paragraph position="4"> First, a comparison between the results on the mixed country data in table 1 and the Hungary data in table 2 shows that the former are more consistent than the latter. The (+LL +SVD) algorithm in particular is very successful on the country data.</Paragraph>
      <Paragraph position="5"> There is thus no sign of the anticipated difficulty with sets of mixed target words.</Paragraph>
      <Paragraph position="6"> Second, when the algorithm is applied to the set of mixed country names, it should not use a stoplist. Not a single time did the resulting clusters correlate significantly with the target labels -- the results were therefore not included here. A possible reason may be that the useful co-occurrences in this data tend to be words on the stoplist, but it should be studied more carefully if this is indeed the case.</Paragraph>
      <Paragraph position="7"> On the Hungary data, the use of a stoplist has a different effect. Overall success rate remains more or less the same (although F-scores with a stoplist are slightly lower on average), but the results display a different pattern. Broadly speaking, a stoplist is most beneficial when feature selection proceeds on the basis of frequency and when large contexts are used. Smaller contexts are more successful without a stoplist. There is a logic to this: as I observed above, stoplist words may be informative about the reading of a possibly metonymical word, but their usefulness increases when they are closer to the target. If go occurs within three words of a country name, it may point towards a literal reading; if it occurs within a context of twenty words, it is less likely to do so. This explains why stoplists work best in combination with bigger contexts.</Paragraph>
      <Paragraph position="8"> Overall, the influence of context is hard to determine. Small windows of three words on either  side of the target are generally most successful, but the context size that should be chosen depends on other characteristics of the algorithm. The same is true for dimensionality reduction and statistical feature selection. In general, the anticipated negative effects of dimensionality reduction were not observed, and frequency-based feature selection clearly benefited algorithms with a stoplist on the Hungary data. However, the algorithms should be applied to more data sets in order to investigate the precise effect of these factors.</Paragraph>
      <Paragraph position="9"> In short, although the investigated unsupervised algorithms never beat the majority baseline for Markert and Nissim's (2002b) data, they are often able to identify two clusters of data that correlate with the two possible readings. This is true for the set with one target word as well as for the set with mixed country names. In general, the algorithms that incorporate both statistical feature selection and Singular Value Decomposition lead to the best results, except for the Hungary data when no stoplist is used. In this last case, statistical feature selection is best dropped and a large context window should be chosen.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="29" end_page="30" type="metho">
    <SectionTitle>
3 Memory-based metonymy recognition
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="29" end_page="29" type="sub_section">
      <SectionTitle>
3.1 Background
</SectionTitle>
      <Paragraph position="0"> Memory-Based Learning (MBL), which is implemented in the TiMBL classifier (Daelemans et al., 2004)5 rests on the hypothesis that people interpret new examples of a phenomenon by comparing them to &amp;quot;stored representations of earlier experiences&amp;quot; (Daelemans et al., 2004, p.19). It is thus related to Case-Based reasoning, which holds that &amp;quot;[r]eference to previous similar situations is often necessary to deal with the complexities of novel situations&amp;quot; (Kolodner, 1993, p.5). As a result of this learning hypothesis, an MBL classifier such as TiMBL eschews the formulation of complex rules or the computation of probabilities during its training phase. Instead it remembers all training vectors and gives a test vector the most frequent label of the most similar training vectors.</Paragraph>
      <Paragraph position="1"> TiMBL implements a number of MBL algorithms. In my experiments, the so-called IB1-IG algorithm (Daelemans and Van den Bosch, 1992) proved most successful. It computes the distance between two vectors X and Y by adding up the 5This software package is freely available and can be downloaded from http://ilk.uvt.nl/software.html.</Paragraph>
      <Paragraph position="2"> weighted distances d between their corresponding feature values, as in equation (7):</Paragraph>
      <Paragraph position="4"> By default, TiMBL determines the weights for each feature on the basis of the feature's Information Gain (the increase in information that the knowledge of that feature's value brings with it) and the number of values that the feature can have. The precise equations are discussed in Daelemans et al. (2004) and need not concern us any further here.</Paragraph>
    </Section>
    <Section position="2" start_page="29" end_page="30" type="sub_section">
      <SectionTitle>
3.2 Experiments
</SectionTitle>
      <Paragraph position="0"> I again applied this IB1-IG algorithm to Markert and Nissim's (2002b) location corpora. In order to make my results as comparable as possible to Markert and Nissim's (2002a) and Nissim and Markert's (2003), I made two changes in the evaluation process. First, evaluation was now performed with 10-fold cross-validation. Second, in the calculation of accuracy, I made a distinction between the several metonymical labels, so that a misclassification within the metonymical category was penalized as well.</Paragraph>
      <Paragraph position="1"> I conducted two rounds of experiments. The first used only grammatical features: the grammatical function of the word (subj, obj, iobj, pp, gen, premod, passive subj, other), its head, the presence of a second head, and the second head (if present). Such features can be expected to identify metonymies with a high precision, but since metonymies may have a wide variety of heads, performance will likely suffer from data sparseness (Nissim and Markert, 2003). I therefore conducted a second round of experiments, in which I added semantic information to the feature sets, in the form of the WordNet hypernym synsets of the head's first sense.</Paragraph>
      <Paragraph position="2"> WordNet is a machine-readable lexical database that, among other things, structures English verbs, nouns and adjectives in a hierarchy of so-called &amp;quot;synonym sets&amp;quot; or synsets (Fellbaum, 1998). Each word belongs to such a group of synonyms, and each synset &amp;quot;is related to its immediately more general and more specific synsets via direct hypernym and hyponym relations&amp;quot; (Jurafsky and Martin, 2000, p.605). Fear, for instance, belongs to the synset fear, fearfulness, fright, which has emotion as its most immediate, and psychological fea- null TiMBL: TiMBL's results N&amp;M: Nissim and Markert's (2003) results ture as its highest hypernym. This tree structure of synsets thus corresponds to a hierarchy of semantic classes that can be used to add semantic knowledge to a metonymy recognition system.</Paragraph>
      <Paragraph position="3"> My experiments investigated a few constellations of semantic features. The simplest of these used the highest hypernym synset of the head's first sense as an extra feature. A second approach added to the feature vector the head's highest hypernym synsets, with a maximum of ten. If the head did not have 10 hypernyms, its own synset would fill the remaining features. The result of this last approach is that the MBL classifier first looks for heads within the same synset as the test head. If it does not find a word that shares all hypernyms with the test instance, it gradually climbs the synset hierarchy until it finds the training instances that share as many hypernyms as possible. Obviously, this approach is able to make more fine-grained semantic distinctions than the previous one.</Paragraph>
    </Section>
    <Section position="3" start_page="30" end_page="30" type="sub_section">
      <SectionTitle>
3.3 Experimental results
</SectionTitle>
      <Paragraph position="0"> The experiments with grammatical information showed that TiMBL is able to replicate Nissim and Markert's (2003) results. The obtained accuracy and F-scores for the mixed country names in table 4 are almost identical to Nissim and Markert's figures. The results for the Hungary data in table 5 lie slightly lower, but again mirror Nissim and Markert's figures closely (Katja Markert, personal communication). This is all the more promising since my results were reached without any semantic information. Remember that Nissim and Markert's algorithm, in contrast, used Dekang Lin's (1998) clusters of semantically similar words in order to deal with data sparseness.</Paragraph>
      <Paragraph position="1"> Memory-Based Learning does not appear to need this semantic information to arrive at state-of-the-art performance. Instead, it tackles possible data sparseness by its automatic back-off to the grammatical role if the target's head is not found among the training data.</Paragraph>
      <Paragraph position="2">  Of course, the grammatical role of a target word is often not sufficient for determining its literal or metonymical status. Therefore my second round of experiments investigated if performance can still be improved by the addition of semantic information. This does not appear to be the case. Although F-scores for the metonymical category tended to increase slightly (as a result of higher recall values), the system's accuracy hardly changed. In order to check if this was due to the automatic selection of the head's first Word-Net sense, I manually disambiguated all heads in the data. This showed that the first WordNet sense was indeed often incorrect, but the selection of the correct sense did not improve performance. The reason for the failure of WordNet information to give higher results must thus be found elsewhere.</Paragraph>
      <Paragraph position="3"> A first possible explanation is the mismatch between WordNet's synsets and our semantic labels.</Paragraph>
      <Paragraph position="4"> Many synsets cover such a wide variety of words that they allow for several readings of the target, while others are too specific to make generalization possible. A second possible explanation is the predominance of prepositional heads in the data, for which extra semantic information is useless.</Paragraph>
      <Paragraph position="5"> In short, the experiments above demonstrate convincingly that Memory-Based Learning is a simple but robust approach to metonymy recognition. This simplicity is a major asset, and is in stark contrast to the competing approaches to metonymy recognition in the literature. It should be studied, however, if there are other features that can further increase the classifier's performance.</Paragraph>
      <Paragraph position="6"> Attachment information is one such source of information that certainly deserves further attention.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML