File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/p03-1008_metho.xml

Size: 22,738 bytes

Last Modified: 2025-10-06 14:08:13

<?xml version="1.0" standalone="yes"?>
<Paper uid="P03-1008">
  <Title>Syntactic Features and Word Similarity for Supervised Metonymy Resolution</Title>
  <Section position="3" start_page="3" end_page="5" type="metho">
    <SectionTitle>
2 Corpus Study
</SectionTitle>
    <Paragraph position="0"> We summarize (Markert and Nissim, 2002b)'s annotation scheme for location names and present an annotated corpus of occurrences of country names.</Paragraph>
    <Section position="1" start_page="3" end_page="4" type="sub_section">
      <SectionTitle>
2.1 Annotation Scheme for Location Names
</SectionTitle>
      <Paragraph position="0"> We identify literal, metonymic,andmixed readings.</Paragraph>
      <Paragraph position="1"> The literal reading comprises a locative (5) and a political entity interpretation (6).</Paragraph>
      <Paragraph position="2">  (5) coral coast of Papua New Guinea (6) Britain's current account deficit  We distinguish the following metonymic patterns (see also (Lakoff and Johnson, 1980; Fass, 1997; Stern, 1931)). In a place-for-people pattern, a place stands for any persons/organisations associated with it, e.g., for sports teams in (2), (3), and (4), and for the government in (7).</Paragraph>
      <Paragraph position="3">  (7) a cardinal element in Iran's strategy when Iranian naval craft [...] bombarded [...] In a place-for-event pattern, a location name refers to an event that occurred there (e.g., using the word Vietnam for the Vietnam war). In a place-for-product pattern a place stands for a product manufactured there (e.g., the word Bordeaux referring to the local wine).</Paragraph>
      <Paragraph position="4"> The category othermet covers unconventional metonymies, as (1), and is only used if none of the other categories fits (Markert and Nissim, 2002b). We also found examples where two predicates are involved, each triggering a different reading. (8) they arrived in Nigeria, hitherto a leading critic of the South African regime In (8), both a literal (triggered by &amp;quot;arriving in&amp;quot;) and a place-for-peoplereading (triggered by &amp;quot;leading critic&amp;quot;) are invoked. We introduced the category mixed to deal with these cases.</Paragraph>
    </Section>
    <Section position="2" start_page="4" end_page="5" type="sub_section">
      <SectionTitle>
2.2 Annotation Results
</SectionTitle>
      <Paragraph position="0"> Using Gsearch (Corley et al., 2001), we randomly extracted 1000 occurrences of country names from the BNC, allowing any country name and its variants listed in the CIA factbook  As the explicit referent is often underspecified, we introduce place-for-people as a supertype category and we evaluate our system on supertype classification in this paper. In the annotation, we further specify the different groups of people referred to, whenever possible (Markert and Nissim, 2002b).</Paragraph>
      <Paragraph position="2"> 1998) to occur. Each country name is surrounded by three sentences of context.</Paragraph>
      <Paragraph position="3"> The 1000 examples of our corpus have been independently annotated by two computational linguists, who are the authors of this paper. The annotation can be considered reliable (Krippendorff, 1980) with 95% agreement and a kappa (Carletta, 1996) of .88.</Paragraph>
      <Paragraph position="4"> Our corpus for testing and training the algorithm includes only the examples which both annotators could agree on and which were not marked as noise (e.g. homonyms, as &amp;quot;Professor Greenland&amp;quot;), for a total of 925. Table 1 reports the reading distribution.  The corpus distribution confirms that metonymies that do not follow established metonymic patterns (othermet) are very rare. This seems to be the case for other kinds of metonymies, too (Verspoor, 1997). We can therefore reformulate metonymy resolution as a classification task between the literal reading and a fixed set of metonymic patterns that can be identified in advance for particular semantic classes. This approach makes the task comparable to classic word sense disambiguation (WSD), which is also concerned with distinguishing between possible word senses/interpretations.</Paragraph>
      <Paragraph position="5"> However, whereas a classic (supervised) WSD algorithm is trained on a set of labelled instances of one particular word and assigns word senses to new test instances of the same word, (supervised) metonymy recognition can be trained on a set of labelled instances of different words of one semantic class and assign literal readings and metonymic patterns to new test instances of possibly different words of the same semantic class. This class-based approach enables one to, for example, infer the reading of (3) from that of (2).</Paragraph>
      <Paragraph position="6"> We use a decision list (DL) classifier. All features encountered in the training data are ranked in the DL (best evidence first) according to the following log-</Paragraph>
      <Paragraph position="8"> We estimated probabilities via maximum likelihood, adopting a simple smoothing method (Martinez and Agirre, 2000): 0.1 is added to both the denominator and numerator.</Paragraph>
      <Paragraph position="9"> The target readings to be distinguished are  literal,place-for-people,place-forevent, place-for-product,othermet and mixed. All our algorithms are tested on our annotated corpus, employing 10-fold cross-validation. We evaluate accuracy and coverage:  We also use a backing-off strategy to the most frequent reading (literal) for the cases where no decision can be made. We report the results as accuracy backoff (Acc b ); coverage backoff is always 1. We are also interested in the algorithm's performance in recognising non-literal readings. Therefore, we compute precision (P), recall (R), and F-measure (F), where A is the number of non-literal readings correctly identified as non-literal (true positives) and B the number of literal readings that are incorrectly identified as non-literal (false positives):</Paragraph>
      <Paragraph position="11"> The baseline used for comparison is the assignment of the most frequent reading literal.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="5" end_page="8" type="metho">
    <SectionTitle>
4 Context Reduction
</SectionTitle>
    <Paragraph position="0"> We show that reducing the context to head-modifier relations involving the Possibly Metonymic Word achieves high precision metonymy recognition.</Paragraph>
    <Paragraph position="1">  In (Markert and Nissim, 2002a), we also considered local and topical cooccurrences as contextual features. They constantly achieved lower precision than grammatical features.  role-of-head (r-of-h) example subj-of-win England won the World Cup (place-for-people) subjp-of-govern Britain has been governed by . . . (literal) dobj-of-visit the Apostle had visited Spain (literal) gen-of-strategy in Iran'sstrategy... (place-for-people) premod-of-veteran a Vietnam veteran from Rhode Island (place-for-event) ppmod-of-with its border with Hungary (literal)  role freq #non-lit subj 92 65 subjp 64 dobj 28 12 gen 93 20 premod 94 13 ppmod 522 57 other 90 17 total 925 188  We represent each example in our corpus by a single feature role-of-head, expressing the grammatical role of the PMW (limited to (active) subject, passive subject, direct object, modifier in a prenominal genitive, other nominal premodifier, dependent in a prepositional phrase) and its lemmatised lexical head within a dependency grammar framework.  Table 2 shows example values and Table 3 the role distribution in our corpus.</Paragraph>
    <Paragraph position="2"> We trained and tested our algorithm with this feature (hmr).</Paragraph>
    <Paragraph position="3">  Results for hmr are reported in the first line of Table 5. The reasonably high precision (74.5%) and accuracy (90.2%) indicate that reducing the context to a head-modifier feature does not cause loss of crucial information in most cases. Low recall is mainly due to low coverage (see Problem 2 below). We identified two main problems. Problem 1. The feature can be too simplistic, so that decisions based on the head-modifier relation can assign the wrong reading in the following cases: &amp;quot;Bad&amp;quot; heads: Some lexical heads are semantically empty, thus failing to provide strong evidence for any reading and lowering both recall and precision. Bad predictors are the verbs &amp;quot;to have&amp;quot; and &amp;quot;to be&amp;quot; and some prepositions such as &amp;quot;with&amp;quot;, which can be used with metonymic (talk with Hungary) and literal (border with Hungary) readings. This problem is more serious for function than for content word heads: precision on the set of subjects and objects is 81.8%, but only 73.3% on PPs.</Paragraph>
    <Paragraph position="4"> &amp;quot;Bad&amp;quot; relations: The premod relation suffers from noun-noun compound ambiguity. US op- null We consider only one link per PMW, although cases like (8) would benefit from including all links the PMW participates in.  The feature values were manually annotated for the following experiments, adapting the guidelines in (Poesio, 2000). The effect of automatic feature extraction is described in Section 6. eration can refer to an operation in the US (literal) or by the US (metonymic).</Paragraph>
    <Paragraph position="5"> Other cases: Very rarely neglecting the remaining context leads to errors, even for &amp;quot;good&amp;quot; lexical heads and relations. Inferring from the metonymy in (4) that &amp;quot;Germany&amp;quot; in &amp;quot;Germany lost a fifth of its territory&amp;quot; is also metonymic, e.g., is wrong and lowers precision.</Paragraph>
    <Paragraph position="6"> However, wrong assignments (based on head-modifier relations) do not constitute a major problem as accuracy is very high (90.2%).</Paragraph>
    <Paragraph position="7"> Problem 2. The algorithm is often unable to make any decision that is based on the head-modifier relation. This is by far the more frequent problem, which we adress in the remainder of the paper. The feature role-of-head accounts for the similarity between (2) and (3) only, as classification of a test instance with a particular feature value relies on having seen exactly the same feature value in the training data. Therefore, we have not tackled the inference from (2) or (3) to (4). This problem manifests itself in data sparseness and low recall and coverage, as many heads are encountered only once in the corpus. As hmr's coverage is only 63.1%, backoff to a literal reading is required in 36.9% of the cases.</Paragraph>
  </Section>
  <Section position="5" start_page="8" end_page="9" type="metho">
    <SectionTitle>
5 Generalising Context Similarity
</SectionTitle>
    <Paragraph position="0"> In order to draw the more complex inference from (2) or (3) to (4) we need to generalise context similarity. We relax the identity constraint of the original algorithm (the same role-of-head value of the test instance must be found in the DL), exploiting two similarity levels. Firstly, we allow to draw inferences over similar values of lexical heads (e.g. from subj-of-win to subj-of-lose), rather than over identical ones only. Secondly, we allow to discard the  lexical head and generalise over the PMW's grammatical role (e.g. subject). These generalisations allow us to double recall without sacrificing precision or increasing the size of the training set.</Paragraph>
    <Section position="1" start_page="8" end_page="9" type="sub_section">
      <SectionTitle>
5.1 Relaxing Lexical Heads
</SectionTitle>
      <Paragraph position="0"> We regard two feature values r-of-h and r-of-h</Paragraph>
      <Paragraph position="2"> are similar. In order to capture the similarity between h and h  we integrate a thesaurus (Lin, 1998) in our algorithm's testing phase. In Lin's thesaurus, similarity between words is determined by their distribution in dependency relations in a newswire corpus. For a content word h (e.g., &amp;quot;lose&amp;quot;) of a specific part-of-speech a set of similar words h of the same part-of-speech is given. The set members are ranked in decreasing order by a similarity score. Table 4 reports example entries.</Paragraph>
      <Paragraph position="3">  Our modified algorithm (relax I) is as follows:  1. train DL with role-of-head as in hmr; for each test instance observe the following procedure (r-of-h indicates the feature value of the test instance); 2. if r-of-h is found in the DL, apply the corresponding rule and stop;</Paragraph>
      <Paragraph position="5"> is found in the DL, apply corresponding rule and stop; if r-of-h</Paragraph>
      <Paragraph position="7"> is not found in the DL, increase i by 1 and go to (a); The examples already covered by hmr are classified in exactly the same way by relax I (see Step 2). Let us therefore assume we encounter the test instance (4), its feature value subj-of-lose has not been seen in the training data (so that Step 2 fails and Step 2  has to be applied) and subj-of-win is in the DL. For all n 1, relax I will use the rule for subj-of-win to assign a reading to &amp;quot;Scotland&amp;quot; in (4) as &amp;quot;win&amp;quot; is the most similar word to &amp;quot;lose&amp;quot; in the thesaurus (see Table 4). In this case (2b') is only  In the original thesaurus, each h is subdivided into clusters. We do not take these divisions into account.  applied once as already the first iteration over the thesaurus finds a word h  to 41% and precision increases from 74.5% in hmr to 80.2%, yielding an increase in F-measure from 29.8% to 54.2% (n =50). Coverage rises to 78.9% and accuracy backoff to 85.1% (Table 5). Whereas the increase in coverage and recall is quite intuitive, the high precision achieved by relax I requires further explanation. Let S be the set of examples that relax I covers. It consists of two subsets: S1 is the subset already covered by hmr and its treatment does not change in relax I, yielding the same precision. S2 is the set of examples that relax I covers in addition to hmr. The examples in S2 consist of cases with highly predictive content word heads as (a) function words are not included in the thesaurus and (b) unpredictive content word heads like &amp;quot;have&amp;quot; or &amp;quot;be&amp;quot; are very frequent and normally already covered by hmr (they are therefore members of S1). Precision on S2 is very high (84%) and raises the overall precision on the set S.</Paragraph>
      <Paragraph position="8"> Cases that relax I does not cover are mainly due  names or alternative spelling), (b) the small number of training instances for some grammatical roles (e.g. dobj), so that even after 50 thesaurus iterations no similar role-of-head value could be found that is covered in the DL, or (c) grammatical roles that are not covered (other in Table 3).</Paragraph>
    </Section>
    <Section position="2" start_page="9" end_page="9" type="sub_section">
      <SectionTitle>
5.2 Discarding Lexical Heads
</SectionTitle>
      <Paragraph position="0"> Another way of capturing the similarity between (3) and (4), or (7) and (9) is to ignore lexical heads and generalise over the grammatical role (role)ofthe PMW (with the feature values as in Table 3: subj, subjp, dobj, gen, premod, ppmod). We therefore developed the algorithm relax II.</Paragraph>
      <Paragraph position="1">  1. train decision lists: (a) DL1 with role-of-head as in hmr (b) DL2 with role; for each test instance observe the following procedure (rof-h and r are the feature values of the test instance); 2. if r-of-h is found in the DL1, apply the corresponding rule and stop; 2' otherwise,ifr is found in DL2, apply the corresponding rule.</Paragraph>
      <Paragraph position="2"> Let us assume we encounter the test instance (4), subj-of-lose is not in DL1 (so that Step 2 fails and Step 2  has to be applied) and subj is in DL2.</Paragraph>
      <Paragraph position="3"> The algorithm relax II will assign a place-for-people reading to &amp;quot;Scotland&amp;quot;, as most subjects in our corpus are metonymic (see Table 3).</Paragraph>
      <Paragraph position="4"> Generalising over the grammatical role outperforms hmr, achieving 81.3% precision, 44.1% recall, and 57.2% F-measure (see Table 5). The algorithm relax II also yields fewer false negatives than relax I (and therefore higher recall) since all subjects not covered in DL1 are assigned a metonymic reading, which is not true for relax I.</Paragraph>
    </Section>
    <Section position="3" start_page="9" end_page="9" type="sub_section">
      <SectionTitle>
5.3 Combining Generalisations
</SectionTitle>
      <Paragraph position="0"> There are several ways of combining the algorithms we introduced. In our experiments, the most successful one exploits the facts that relax II performs better than relax I on subjects and that relax I performs better on the other roles. Therefore the algorithm combination uses relax II if the test instance is a subject, and relax I otherwise. This yields the best results so far, with 87% accuracy backoff and 62.7% F-measure (Table 5).</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="9" end_page="10" type="metho">
    <SectionTitle>
6 Influence of Parsing
</SectionTitle>
    <Paragraph position="0"> The results obtained by training and testing our classifier with manually annotated grammatical relations are the upper bound of what can be achieved by using these features. To evaluate the influence parsing has on the results, we used the RASP toolkit (Briscoe and Carroll, 2002) that includes a pipeline of tokenisation, tagging and state-of-the-art statistical parsing, allowing multiple word tags. The toolkit also maps parse trees to representations of grammatical relations, which we in turn could map in a straightforward way to our role categories.</Paragraph>
    <Paragraph position="1"> RASP produces at least partial parses for 96% of our examples. However, some of these parses do not assign any role of our roleset to the PMW -only 76.9% of the PMWs are assigned such a role by RASP (in contrast to 90.2% in the manual annotation; see Table 3). RASP recognises PMW subjects with 79% precision and 81% recall. For PMW direct objects, precision is 60% and recall 86%.</Paragraph>
    <Paragraph position="2">  We reproduced all experiments using the automatically extracted relations. Although the relative performance of the algorithms remains mostly unchanged, most of the resulting F-measures are more than 10% lower than for hand annotated roles (Table 6). This is in line with results in (Gildea and Palmer, 2002), who compare the effect of manual and automatic parsing on semantic predicate-argument recognition.</Paragraph>
  </Section>
  <Section position="7" start_page="10" end_page="13" type="metho">
    <SectionTitle>
7 Related Work
</SectionTitle>
    <Paragraph position="0"> Previous Approaches to Metonymy Recognition.</Paragraph>
    <Paragraph position="1"> Our approach is the first machine learning algorithm to metonymy recognition, building on our previous  We did not evaluate RASP's performance on relations that do not involve the PMW.</Paragraph>
    <Paragraph position="2"> Table 6: Results summary for the different algorithms using RASP. For relax I and combination we report best results (50 thesaurus iterations).  work (Markert and Nissim, 2002a). The current approach expands on it by including a larger number of grammatical relations, thesaurus integration, and an assessment of the influence of parsing. Best F-measure for manual annotated roles increased from 46.7% to 62.7% on the same dataset.</Paragraph>
    <Paragraph position="3"> Most other traditional approaches rely on hand-crafted knowledge bases or lexica and use violations of hand-modelled selectional restrictions (plus sometimes syntactic violations) for metonymy recognition (Pustejovsky, 1995; Hobbs et al., 1993; Fass, 1997; Copestake and Briscoe, 1995; Stallard, 1993).</Paragraph>
    <Paragraph position="4">  In these approaches, selectional restrictions (SRs) are not seen as preferences but as absolute constraints. If and only if such an absolute constraint is violated, a non-literal reading is proposed. Our system, instead, does not have any a priori knowledge of semantic predicate-argument restrictions. Rather, it refers to previously seen training examples in head-modifier relations and their labelled senses and computes the likelihood of each sense using this distribution. This is an advantage as our algorithm also resolved metonymies without SR violations in our experiments. An empirical comparison between our approach in (Markert and Nissim, 2002a)  and an SRs violation approach showed that our approach performed better.</Paragraph>
    <Paragraph position="5"> In contrast to previous approaches (Fass, 1997; Hobbs et al., 1993; Copestake and Briscoe, 1995; Pustejovsky, 1995; Verspoor, 1996; Markert and Hahn, 2002; Harabagiu, 1998; Stallard, 1993), we use a corpus reliably annotated for metonymy for evaluation, moving the field towards more objective  (Markert and Hahn, 2002) and (Harabagiu, 1998) enhance this with anaphoric information. (Briscoe and Copestake, 1999) propose using frequency information besides syntactic/semantic restrictions, but use only a priori sense frequencies without contextual features.</Paragraph>
    <Paragraph position="6">  Note that our current approach even outperforms (Markert and Nissim, 2002a).</Paragraph>
    <Paragraph position="7"> evaluation procedures.</Paragraph>
    <Paragraph position="8"> Word Sense Disambiguation. We compared our approach to supervised WSD in Section 3, stressing word-to-word vs. class-to-class inference. This allows for a level of abstraction not present in standard supervised WSD. We can infer readings for words that have not been seen in the training data before, allow an easy treatment of rare words that undergo regular sense alternations and do not have to annotate and train separately for every individual word to treat regular sense distinctions.</Paragraph>
    <Paragraph position="9">  By exploiting additional similarity levels and integrating a thesaurus we further generalise the kind of inferences we can make and limit the size of annotated training data: as our sampling frame contains 553 different names, an annotated data set of 925 samples is quite small. These generalisations over context and collocates are also applicable to standard WSD and can supplement those achieved e.g., by subcategorisation frames (Martinez et al., 2002). Our approach to word similarity to overcome data sparseness is perhaps most similar to (Karov and Edelman, 1998). However, they mainly focus on the computation of similarity measures from the training data. We instead use an off-the-shelf resource without adding much computational complexity and achieve a considerable improvement in our results.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML