File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/e06-1042_metho.xml

Size: 24,273 bytes

Last Modified: 2025-10-06 14:10:04

<?xml version="1.0" standalone="yes"?>
<Paper uid="E06-1042">
  <Title>A Clustering Approach for the Nearly Unsupervised Recognition of Nonliteral Language[?]</Title>
  <Section position="4" start_page="329" end_page="332" type="metho">
    <SectionTitle>
3 TroFi
</SectionTitle>
    <Paragraph position="0"> TroFiis not a metaphor processing system. It does not claim to interpret metonymy and it will not tell you what a given idiom means. Rather, TroFi attemptstoseparate literalusages ofverbsfromnonliteral ones.</Paragraph>
    <Paragraph position="1"> For the purposes of this paper we will take the simplified view that literal is anything that falls within accepted selectional restrictions (&amp;quot;he was forced to eat his spinach&amp;quot; vs. &amp;quot;he was forced to eat his words&amp;quot;) or our knowledge of the world (&amp;quot;the sponge absorbed the water&amp;quot; vs. &amp;quot;the company absorbed the loss&amp;quot;). Nonliteral is then anything that is &amp;quot;not literal&amp;quot;, including most tropes, such as metaphors, idioms, as well phrasal verbs and other anomalous expressions that cannot really be seen as literal. In terms of metonymy, TroFi may cluster a verb used in a metonymic expression such as &amp;quot;IreadKeats&amp;quot; asnonliteral, but wemakenostrong claims about this.</Paragraph>
    <Section position="1" start_page="329" end_page="330" type="sub_section">
      <SectionTitle>
3.1 The Data
</SectionTitle>
      <Paragraph position="0"> The TroFi algorithm requires a target set (called original set in (Karov &amp; Edelman, 1998)) - the set of sentences containing the verbs to be classified into literal or nonliteral - and the seed sets: the literal feedback set and the nonliteral feed-back set. These sets contain feature lists consisting of the stemmed nouns and verbs in a sentence, with target or seed words and frequent words removed. The frequent word list (374 words) consists of the 332 most frequent words in the British National Corpus plus contractions, single letters, and numbers from 0-10. The target set is built using the '88-'89 Wall Street Journal Corpus (WSJ) tagged using the (Ratnaparkhi, 1996) tagger and the (Bangalore &amp; Joshi, 1999) SuperTagger; the feedback sets are built using WSJ sentences con- null Algorithm 1 KE-train: (Karov &amp; Edelman, 1998) algorithm adapted to literal/nonliteral classification  Require: S: the set of sentences containing the target word Require: L: the set of literal seed sentences Require: N: the set of nonliteral seed sentences Require: W: the set of words/features, w [?] s means w is in sentence s, s owner w means s contains w Require: epsilon1: threshold that determines the stopping condition 1: w-sim0(wx,wy) := 1 if wx = wy,0 otherwise 2: s-simI0(sx,sy) := 1, for all sx,sy [?] S xS where sx = sy, 0 otherwise 3: i := 0 4: while (true) do 5: s-simLi+1(sx,sy) := summationtextwx[?]sx p(wx,sx)maxwy[?]sy w-simi(wx,wy), for all sx,sy [?] S xL 6: s-simNi+1(sx,sy) := summationtextwx[?]sx p(wx,sx)maxwy[?]sy w-simi(wx,wy), for all sx,sy [?] S xN 7: for wx,wy [?] W xW do</Paragraph>
      <Paragraph position="2"> else summationtextsxownerwx p(wx,sx)maxsyownerwy{s-simLi (sx,sy),s-simNi (sx,sy)} 9: end for 10: if [?]wx,maxwy{w-simi+1(wx,wy)[?]w-simi(wx,wy)} [?] epsilon1 then 11: break # algorithm converges in 1epsilon1 steps.</Paragraph>
      <Paragraph position="3"> 12: end if 13: i := i + 1 14: end while  taining seed words extracted from WordNet and the databases of known metaphors, idioms, and expressions (DoKMIE), namely Wayne Magnuson English Idioms Sayings &amp; Slang and George Lakoff's Conceptual Metaphor List, as well as example sentences from these sources. (See Section 4forthesizes ofthetarget and feedback sets.) One may ask why we need TroFi if we have databases like the DoKMIE. The reason is that the DoKMIE are unlikely to list all possible instances of non-literal language and because knowing that an expression can be used nonliterally does not mean that you can tell when it is being used nonliterally. The target verbs may not, and typically do not, appear in the feedback sets. In addition, the feedback sets are noisy and not annotated by any human, which is why we call TroFi unsupervised.  WhenweuseWordNetasasourceofexamplesentences, or of seed words for pulling sentences out of the WSJ, for building the literal feedback set, we cannot tell if the WordNet synsets, or the collected feature sets, are actually literal. We provide some automatic methods in Section 3.3 to ensure that the feedback set feature sets that will harm us in the clustering phase are removed. As a sideeffect, we may fill out sparse nonliteral sets. In the next section we look at the Core TroFi algorithm and its use of the above data sources.</Paragraph>
    </Section>
    <Section position="2" start_page="330" end_page="331" type="sub_section">
      <SectionTitle>
3.2 Core Algorithm
</SectionTitle>
      <Paragraph position="0"> Since we are attempting to reduce the problem of literal/nonliteral recognition to one of word-sense disambiguation, TroFi makes use of an existing similarity-based word-sense disambiguation algorithm developed by (Karov &amp; Edelman, 1998), henceforth KE.</Paragraph>
      <Paragraph position="1"> The KE algorithm is based on the principle of attraction: similarities are calculated between sentences containing the word we wish to disambiguate (the target word) and collections of seed sentences (feedback sets) (see also Section 3.1).</Paragraph>
      <Paragraph position="2"> A target set sentence is considered to be attracted to the feedback set containing the sentence to which it shows the highest similarity. Two sentences aresimilar ifthey contain similar wordsand two words are similar if they are contained in similar sentences. The resulting transitive similarity allows us to defeat the knowledge acquisition bottleneck - i.e. the low likelihood of finding all possible usages of a word in a single corpus. Note that the KE algorithm concentrates on similarities in the way sentences use the target literal or non-literal word, not on similarities in the meanings of the sentences themselves.</Paragraph>
      <Paragraph position="3"> Algorithms 1 and 2 summarize the basic TroFi version of the KE algorithm. Note that p(w,s) is the unigram probability of word w in sentence s,  Algorithm 2 KE-test: classifying literal/nonliteral  1: For any sentence sx [?] S 2: if maxsy s-simL(sx,sy) &gt; maxsy s-simN(sx,sy) then 3: tag sx as literal 4: else 5: tag sx as nonliteral 6: end if normalized by the total number of words in s.  In practice, initializing s-simI0 in line (2) of Algorithm 1 to 0 and then updating it from w-sim0 means that each target sentence is still maximally similar to itself, but we also discover additional similarities between target sentences. We further enhance the algorithm by using Sum of Similarities. To implement this, in Algorithm 2 we change line (2) into:summationtext</Paragraph>
      <Paragraph position="5"> Although it is appropriate for fine-grained tasks like word-sense disambiguation to use the single highest similarity scoreinordertominimizenoise, summing across all the similarities of a target set sentence to the feedback set sentences is more appropriate for literal/nonliteral clustering, where the usages could be spread across numerous sentences in the feedback sets. We make another modification to Algorithm 2 by checking that the maximum sentence similarity in line (2) is above a certain threshold for classification. If the similarity is above this threshold, we label a target-word sentence as literal or nonliteral.</Paragraph>
      <Paragraph position="6"> Before continuing, let us look at an example.</Paragraph>
      <Paragraph position="7"> The features are shown in bold.</Paragraph>
    </Section>
    <Section position="3" start_page="331" end_page="331" type="sub_section">
      <SectionTitle>
Target Set
</SectionTitle>
      <Paragraph position="0"> N2 This idea is risky, but it looks like the director of the institute has comprehended the basic principles behind it.</Paragraph>
      <Paragraph position="1"> N3 Mrs. Fipps is having trouble comprehending the legal straits of the institute.</Paragraph>
      <Paragraph position="2"> N4 She had a hand in his fully comprehending the quandary. The target set consists of sentences from the corpus containing the target word. The feedback sets contain sentences from the corpus containing synonyms of the target word found in WordNet (literal feedback set) and the DoKMIE (nonliteral feedback set). The feedback sets also contain example sentences provided in the target-word entriesofthese datasets. TroFiattempts tocluster the target set sentences into literal and nonliteral by attracting them to the corresponding feature sets using Algorithms 1 &amp; 2. Using the basic KE algorithm, target sentence 2 is correctly attracted to the nonliteral set, and sentences 1 and 3 are equally attracted to both sets. When we apply our sum of similarities enhancement, sentence 1 is correctly attracted to the literal set, but sentence 3 is now incorrectly attracted to the literal set too. In the following sections wedescribe some enhancements Learners &amp; Voting, SuperTags, and Context - that try to solve the problem of incorrect attractions.</Paragraph>
    </Section>
    <Section position="4" start_page="331" end_page="332" type="sub_section">
      <SectionTitle>
3.3 Cleaning the Feedback Sets
</SectionTitle>
      <Paragraph position="0"> In this section we describe how we clean up the feedback sets to improve the performance of the Core algorithm. We also introduce the notion of Learners &amp; Voting.</Paragraph>
      <Paragraph position="1"> Recallthat neither the raw data nor thecollected feedback sets are manually annotated for training purposes. Since, in addition, the feedback sets are collected automatically, they are very noisy. For instance, in the example in Section 3.2, the literal feedback set sentence L3 contains an idiom which was provided as an example sentence in WordNet as a synonym for &amp;quot;grasp&amp;quot;. In N4, we have the side-effect feature &amp;quot;hand&amp;quot;, which unfortunately overlaps with the feature &amp;quot;hand&amp;quot; that we might hope to find in the literal set (e.g. &amp;quot;grasp his hand&amp;quot;). In order to remove sources of false attraction like these, we introduce the notion of scrubbing. Scrubbing is founded on a few basic principles. The first is that the contents of the DoKMIE come from (third-party) human annotations and are thus trusted. Consequently we take them as primary and use them to scrub the WordNet synsets. The second is that phrasal and expression verbs, for example &amp;quot;throw away&amp;quot;, are often indicative of nonliteral uses of verbs - i.e. they are not the sum of their parts - so they can be used for scrubbing. The third is that content words appearing in both feedback sets - for example &amp;quot;the wind is blowing&amp;quot; vs. &amp;quot;the winds of war are blowing&amp;quot; for the target word &amp;quot;blow&amp;quot; - will lead to impure feedback sets, a situation we want to avoid.</Paragraph>
      <Paragraph position="2"> The fourth is that our scrubbing action can take a number of different forms: we can choose to scrub  just a word, a whole synset, or even an entire feature set. In addition, we can either move the offending item to the opposite feedback set or remove it altogether. Moving synsets or feature sets can add valuable content to one feedback set while removing noise from the other. However, it can also cause unforeseen contamination. We experimented with a number of these options to produce a whole complement of feedback set learners for classifying the target sentences. Ideally this will allow the different learners to correct each other.</Paragraph>
      <Paragraph position="3"> For Learner A, we use phrasal/expression verbs and overlap as indicators to select whole Word-Net synsets for moving over to the nonliteral feed-back set. In our example, this causes L1-L3 to be moved to the nonliteral set. For Learner B, we use phrasal/expression verbs and overlap as indicators to remove problematic synsets. Thus weavoid accidentally contaminating the nonliteral set. However, we do end up throwing away information that could have been used to pad out sparse nonliteral sets. In our example, this causes L1-L3 to be dropped. For Learner C, we remove feature sets from the final literal and nonliteral feedback sets based on overlapping words. In our example, this causes L2 and N4 to be dropped. Learner D is the baseline - no scrubbing. We simply use the basic algorithm. Each learner has benefits and shortcomings. In order to maximize the former and minimize the latter, instead of choosing the single most successful learner, we introduce a voting system. We use a simple majority-rules algorithm, with the strongest learners weighted more heavily. In our experiments we double the weights of Learners A and D. In our example, this results in sentence 3 now being correctly attracted to the nonliteral set.</Paragraph>
    </Section>
    <Section position="5" start_page="332" end_page="332" type="sub_section">
      <SectionTitle>
3.4 Additional Features
</SectionTitle>
      <Paragraph position="0"> Evenbefore voting, weattempttoimprovethecorrectness of initial attractions through the use of SuperTags, which allows us to add internal structure information to the bag-of-words feature lists.</Paragraph>
      <Paragraph position="1"> SuperTags (Bangalore &amp; Joshi, 1999) encode a great deal of syntactic information in a single tag (each tag is an elementary tree from the XTAG English Tree Adjoining Grammar). In addition to a word's part of speech, they also encode information about its location in a syntactic tree i.e. we learn something about the surrounding words as well. We devised a SuperTag trigram composed of the SuperTag of the target word and the following two words and their SuperTags if they contain nouns, prepositions, particles, or adverbs. This is helpful in cases where the same set of features can be used as part of both literal and nonliteral expressions. For example, turning &amp;quot;It's hard to kick a habit like drinking&amp;quot; into &amp;quot;habit drink kick/B nx0Vpls1 habit/A NXN,&amp;quot; results in a higher attraction to sentences about &amp;quot;kicking habits&amp;quot; than to sentences like &amp;quot;She has a habit of kicking me when she's been drinking.&amp;quot; Note that the creation of Learners A and B changes if SuperTags are used. In the original version, we only move or remove synsets basedonphrasal/expression verbsandoverlapping words. If SuperTags are used, we also move or remove feature sets whose SuperTag trigram indicates phrasal verbs (verb-particle expressions).</Paragraph>
      <Paragraph position="2"> A final enhancement involves extending the context to help with disambiguation. Sometimes critical disambiguation features are contained not in the sentence with the target word, but in an adjacent sentence. To add context, we simply group the sentence containing the target word with a specified number of surrounding sentences and turn the whole group into a single feature set.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="332" end_page="334" type="metho">
    <SectionTitle>
4 Results
</SectionTitle>
    <Paragraph position="0"> TroFi was evaluated on the 25 target words listed in Table 1. The target sets contain from 1 to 115 manually annotated sentences for each verb. The first round of annotations was done by the first annotator. The second annotator was given no instructions besides a few examples of literal and nonliteral usage (not covering all target verbs).</Paragraph>
    <Paragraph position="1"> The authors of this paper were the annotators. Our inter-annotator agreement on the annotations used as test data in the experiments in this paper is quite high. k (Cohen) and k (S&amp;C) on a random sample of 200 annotated examples annotated by two different annotators was found to be 0.77. As per ((Di Eugenio &amp; Glass, 2004), cf. refs therein), the standard assessment for k values is that tentative conclusions on agreement exists when .67 [?] k &lt; .8, and a definite conclusion on agreement exists when k [?] .8.</Paragraph>
    <Paragraph position="2"> In the case of a larger scale annotation effort, having the person leading the effort provide one or two examples of literal and nonliteral usages for each target verb to each annotator would almost certainly improve inter-annotator agreement.</Paragraph>
    <Paragraph position="3"> Table 1 lists the total number of target sentences, plus the manually evaluated literal and nonliteral  counts, for each target word. It also provides the feedback set sizes for each target word. The totals across all words are given at the bottom of the table.</Paragraph>
    <Paragraph position="4"> absorb assault die drag drown  The algorithms were evaluated based on how accurately they clustered the hand-annotated sentences. Sentences that were attracted to neither cluster or were equally attracted to both were put in the opposite set from their label, making a failure to cluster a sentence an incorrect clustering. Evaluation results were recorded as recall, precision, and f-score values. Literal recall is defined as (correct literals in literal cluster / total correct literals). Literal precision is defined as (correct literals in literal cluster / size of literal cluster). If there are no literals, literal recall is 100%; literal precision is 100% if there are no nonliterals in the literal cluster and 0% otherwise. The f-score is defined as (2 * precision * recall) / (precision + recall). Nonliteral precision and recall are defined similarly. Average precision is the average of literal and nonliteral precision; similarly for average recall. For overall performance, we take the f-score of average precision and average recall.</Paragraph>
    <Paragraph position="5"> We calculated two baselines for each word. The first was a simple majority-rules baseline. Due to the imbalance of literal and nonliteral examples, this baseline ranges from 60.9% to 66.7% with an average of 63.6%. Keep in mind though that using this baseline, the f-score for the nonliteral set will always be 0%. We come back to this point at the end of this section. We calculated a second baseline using a simple attraction algorithm. Each target set sentence is attracted to the feed-back set containing the sentence with which it has the mostwords incommon. This corresponds well to the basic highest similarity TroFi algorithm.</Paragraph>
    <Paragraph position="6"> Sentences attracted to neither, or equally to both, sets are put in the opposite cluster to where they belong. Since this baseline actually attempts to distinguish between literal and nonliteral and uses all the data used by the TroFi algorithm, it is the one we will refer to in our discussion below.</Paragraph>
    <Paragraph position="7"> Experiments were conducted to first find the results of the core algorithm and then determine the effects of each enhancement. The results are shown in Figure 1. The last column in the graph shows the average across all the target verbs.</Paragraph>
    <Paragraph position="8"> On average, the basic TroFi algorithm (KE) gives a 7.6% improvement over the baseline, with some words, like &amp;quot;lend&amp;quot; and &amp;quot;touch&amp;quot;, having higher results due to transitivity of similarity. For our sum of similarities enhancement, all the individual target word results except for &amp;quot;examine&amp;quot; sit above the baseline. The dip is due to the fact that while TroFican generate some beneficial similarities between words related by context, it can also generate some detrimental ones. When we use sum of similarities, it is possible for the transitively discovered indirect similarities between a target nonliteral sentence and all the sentences in a feedback set to add up to more than a single direct similarity between the target sentence and a single feedback set sentence. This is not possible with highest similarity because a single sentence would have to show a higher similarity to the target sentence than that produced by sharing an identical word, which is unlikely since transitively discovered similarities generally do not add up to 1. So, although highest similarity occasionally produces better results than using sum of similarities, on average we can expect to get better results with the latter. In this experiment alone, we get an average f-score of 46.3% for the sum of similarities results - a 9.4% improvement over the high similarity results (36.9%) and a 16.9% improvement over the baseline (29.4%).</Paragraph>
    <Paragraph position="9">  In comparing the individual results of all our learners, we found that the results for Learners A and D (46.7% and 46.3%) eclipsed Learners B and C by just over 2.5%. Using majority-rules voting with Learners A and D doubled, we were able to obtain an average f-score of 48.4%, showing that voting does to an extent balance out the learners' varying results on different words.</Paragraph>
    <Paragraph position="10"> The addition of SuperTags caused improvements in some words like &amp;quot;drag&amp;quot; and &amp;quot;stick&amp;quot;. The overall gain was only 0.5%, likely due to an over-generation of similarities. Future work may identify ways to use SuperTags more effectively.</Paragraph>
    <Paragraph position="11"> The use of additional context was responsible for our second largest leap in performance after sum of similarities. We gained 4.9%, bringing us to an average f-score of 53.8%. Worth noting is that the target words exhibiting the most significant improvement, &amp;quot;drown&amp;quot; and &amp;quot;grasp&amp;quot;, had some of the smallest target and feedback set feature sets, supporting the theory that adding cogent features may improve performance.</Paragraph>
    <Paragraph position="12"> With an average of 53.8%, all words but one lie well above our simple-attraction baseline, and some even achieve much higher results than the majority-rules baseline. Note also that, using this latter baseline, TroFi boosts the nonliteral f-score from 0% to 42.3%.</Paragraph>
  </Section>
  <Section position="6" start_page="334" end_page="335" type="metho">
    <SectionTitle>
5 The TroFi Example Base
</SectionTitle>
    <Paragraph position="0"> Inthis section wediscuss the TroFiExample Base.</Paragraph>
    <Paragraph position="1"> First, we examine iterative augmentation. Then we discuss the structure and contents of the example base and the potential for expansion.</Paragraph>
    <Paragraph position="2"> After an initial run for a particular target word, we have the cluster results plus a record of the feedback sets augmented with the newly clustered sentences. Each feedback set sentence is saved with a classifier weight, with newly clustered sentences receiving a weight of 1.0. Subsequent runs may be done to augment the initial clusters. For these runs, we use the classifiers from our initial run as feedback sets. New sentences for clustering are treated like a regular target set. Running TroFi produces new clusters and re-weighted classifiers augmented with newly clustered sentences. There can be as many runs as desired; hence iterative augmentation.</Paragraph>
    <Paragraph position="3"> We used the iterative augmentation process to build a small example base consisting of the target words from Table 1, as well as another 25 words drawn from the examples of scholars whose work  was reviewed in Section 2. It is important to note that in building the example base, we used TroFi with an Active Learning component (see (Birke, 2005)) which improved our average f-score from 53.8% to 64.9% on the original 25 target words.</Paragraph>
    <Paragraph position="4"> An excerpt from the example base is shown in Figure 2. Each entry includes an ID number and a Nonliteral, Literal, or Unannotated tag. Annotations are from testing or from active learning during example-base construction. The TroFi Example Base is available at http://www.cs.sfu.ca/~anoop/students/jbirke/. Further unsupervised expansion of the existing clusters as well as the production of additional clusters is a possibility.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML