File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-1670_metho.xml

Size: 25,022 bytes

Last Modified: 2025-10-06 14:10:45

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-1670">
  <Title>Broad-Coverage Sense Disambiguation and Information Extraction with a Supersense Sequence Tagger[?]</Title>
  <Section position="5" start_page="594" end_page="594" type="metho">
    <SectionTitle>
VERBS
SUPERSENSE VERBS OF SUPERSENSE VERBS OF
</SectionTitle>
    <Paragraph position="0"> body grooming, dressing and bodily care emotion feeling change size, temperature change, intensifying motion walking, flying, swimming cognition thinking, judging, analyzing, doubting perception seeing, hearing, feeling communication telling, asking, ordering, singing possession buying, selling, owning competition fighting, athletic activities social political and social activities and events consumption eating and drinking stative being, having, spatial relations contact touching, hitting, tying, digging weather raining, snowing, thawing, thundering creation sewing, baking, painting, performing  coverage information extraction and word sense disambiguation. Our goal is to simplify the disambiguation task, for both nouns and verbs, to a level at which it can be approached as any other tagging problem, and can be solved with state of the art methods. As a by-product, this task includes and extends NER. We define a tagset based on Wordnet's lexicographers classes, or supersenses (Ciaramita and Johnson, 2003), cf. Table 1. The size of the supersense tagset allows us to adopt a structured learning approach, which takes local dependencies between labels into account. To this extent, we cast the supersense tagging problem as a sequence labeling task and train a discriminative Hidden Markov Model (HMM), based on that of Collins (2002), on the manually annotated Semcor corpus (Miller et al., 1993). In two experiments we evaluate the accuracy of the tagger on the Semcor corpus itself, and on the English &amp;quot;all words&amp;quot; Senseval 3 shared task data (Snyder and Palmer, 2004). The model outperforms remarkably the best known baseline, the first sense heuristic - to the best of our knowledge, for the first time on the most realistic &amp;quot;all words&amp;quot; evaluation setting.</Paragraph>
    <Paragraph position="1"> The paper is organized as follows. Section 2 introduces the tagset, Section 3 discusses related work and Section 4 the learning model. Section 5 reports on experimental settings and results. In Section 6 we summarize our contribution and consider directions for further research.</Paragraph>
  </Section>
  <Section position="6" start_page="594" end_page="595" type="metho">
    <SectionTitle>
2 Supersense tagset
</SectionTitle>
    <Paragraph position="0"> Wordnet (Fellbaum, 1998) is a broad-coverage machine-readable dictionary which includes 11,306 verbs mapped to 13,508 word senses, called synsets, and 114,648 common and proper nouns mapped to 79,689 synsets. Each noun or verb synset is associated with one of 41 broad semantic categories, in order to organize the lexicographer's work of updating and managing the lexicon (see Table 1). Since each lexicographer category groups together many synsets they have been also called supersenses (Ciaramita and Johnson, 2003). There are 26 supersenses for nouns, 15 for verbs. This coarse-grained ontology has a number of attractive features, for the purpose of natural language processing. First, the small size of the set makes it possible to build a single tagger which has positive consequences on robustness. Second, classes, although fairly general, are easily recognizable and not too abstract or vague. More importantly, similar word senses tend to be merged together.</Paragraph>
    <Paragraph position="1"> As an example, Table 2 summarizes all senses of the noun &amp;quot;box&amp;quot;. The 10 synsets are mapped to 6 supersenses: &amp;quot;artifact&amp;quot;, &amp;quot;quantity&amp;quot;, &amp;quot;shape&amp;quot;, &amp;quot;state&amp;quot;, &amp;quot;plant&amp;quot;, and &amp;quot;act&amp;quot;. Three similar senses (2), (7) and (9), and the probably related (8), are merged in the &amp;quot;artifact&amp;quot; supersense. This process can help disambiguation because it removes sub- null 1. {box} (container) &amp;quot;he rummaged through a box of spare parts&amp;quot; - n.artifact 2. {box, loge} (private area in a theater or grandstand  where a small group can watch the performance) &amp;quot;the  royal box was empty&amp;quot; - n.artifact 3. {box, boxful} (the quantity contained in a box) &amp;quot;he gave her a box of chocolates&amp;quot; - n.quantity 4. {corner, box} (a predicament from which a skillful or graceful escape is impossible) &amp;quot;his lying got him into a tight corner&amp;quot; - n.state 5. {box}(arectangulardrawing)&amp;quot;theflowchartcontained many boxes&amp;quot; - n.shape 6. {box, boxwood} (evergreen shrubs or small trees) n.plant null 7. {box} (any one of several designated areas on a ball field where the batter or catcher or coaches are positioned) &amp;quot;the umpire warned the batter to stay in the batter's box&amp;quot; - n.artifact 8. {box, boxseat}(thedriver'sseatonacoach)&amp;quot;anarmed guard sat in the box with the driver&amp;quot; - n.artifact 9. {box} (separate partitioned area in a public place for a few people) &amp;quot;the sentry stayed in his box to avoid the cold&amp;quot; - n.artifact 10. {box} (a blow with the hand (usually on the ear)) &amp;quot;I gave him a good box on the ear&amp;quot; - n.act  synset, the set of synonyms, a definition, an optional example sentence, and the supersense label.</Paragraph>
    <Paragraph position="2"> tle distinctions, which are hard to discriminate and increase the size of the class space. One possible drawback is that senses which one might want to keep separate, e.g., the most common sense box/container (1), can be collapsed with others.</Paragraph>
    <Paragraph position="3"> One might argue that all &amp;quot;artifact&amp;quot; senses share semantic properties which differentiate them from the other senses and can support useful semantic inferences. Unfortunately, there are no general solutions to the problem of sense granularity. However, major senses identified by Wordnet are maintained at the supersense level. Hence, supersensedisambiguated words are also, at least partially, synset-disambiguated.</Paragraph>
    <Paragraph position="4"> Since Wordnet includes both proper and common nouns, the new tagset suggests an extended notion of named entity. As well as the usual NER categories, &amp;quot;person&amp;quot;, &amp;quot;group&amp;quot;, &amp;quot;location&amp;quot;, and &amp;quot;time&amp;quot;2, supersenses include categories such as artifacts, which can be fairly frequent, but usually neglected. To a greater extent than in standard NER, research in Bio-NER has focused on the adoption of richer ontologies for information extraction. Genia (Ohta et al., 2002), for example, is an ontology of 46 classes - with annotated 2The supersense category &amp;quot;group&amp;quot; is rather a superordinate of &amp;quot;organization&amp;quot; and has wider scope. corpus - designed for supporting information extraction in the molecular biology domain. In addition, there is growing interest for extracting relations between entities, as a more useful type of IE (cf. (Rosario and Hearst, 2004)).</Paragraph>
    <Paragraph position="5"> Supersense tagging is inspired by similar considerations, but in a domain-independent setting; e.g., verb supersenses can label semantic interactions between nominal concepts. The following sentence (Example 1), extracted from the data further described in Section 5.1 - shows the information captured by the supersense tagset:  (1) Clara Harrisn.person, one of the guestsn.person in the boxn.artifact, stood upv.motion and demandedv.communication  watern.substance.</Paragraph>
    <Paragraph position="6"> As Example 1 shows there is more information that can be extracted from a sentence than just the names; e.g. the fact that &amp;quot;Clara Harris&amp;quot; and the following &amp;quot;guests&amp;quot; are both tagged as &amp;quot;person&amp;quot; might suggest some sort of co-referentiality, while the coordination of verbs of motion and communication, as in &amp;quot;stood up and demanded&amp;quot;, might be useful for language modeling purposes. In such a setting, structured learning methods, e.g., sequential, can help tagging by taking the senses of the neighboring words into account.</Paragraph>
  </Section>
  <Section position="7" start_page="595" end_page="596" type="metho">
    <SectionTitle>
3 Related Work
</SectionTitle>
    <Paragraph position="0"> Sequential models are common in NER, POS tagging, shallow parsing, etc.. Most of the work in WSD, instead, has focused on labeling each word individually, possibly revising the assignments of senses at the document level; e.g., following the &amp;quot;one sense per discourse&amp;quot; hypothesis (Gale et al., 1992). Although it seems reasonable to assume that occurrences of word senses in a sentence can be correlated, hence that structured learning methods could be successful, there has not been much work on sequential WSD. Segond et al. (1997) are possibly the first to have applied an HMM tagger to semantic disambiguation. Interestingly, to make the method more tractable, they also used the supersense tagset and estimated the model on Semcor. By cross-validation they show a marked improvement over the first sense baseline. However, in (Segond et al., 1997) the tagset is used differently, by defining equivalence classes of words with the same set of senses. From a similar perspective, de Loupy et al. (de Loupy et al., 1998)  also investigated the potential advantages of using HMMs for disambiguation. More recently, variants of the generative HMM have been applied to WSD (Molina et al., 2002; Molina et al., 2004) and evaluated also on Senseval data, showing performance comparable to the first sense baseline.</Paragraph>
    <Paragraph position="1"> Previous work on prediction at the supersense level(CiaramitaandJohnson, 2003; Curran, 2005) has focused on lexical acquisition (nouns exclusively), thus aiming at word type classification rather than tagging. As far as applications are concerned, it has been shown that supersense information can support supervised WSD, by providing a partial disambiguation step (Ciaramita et al., 2003). In syntactic parse re-ranking supersenses have been used to build useful latent semantic features (Koo and Collins, 2005). We believe that supersense tagging has the potential to be useful, in combination with other sources of information such as part of speech, domain-specific NER models, chunking or shallow parsing, in tasks such as question answering and information extraction and retrieval, where large amounts of text need to be processed. It is also possible that this kind of shallow semantic information can help building more sophisticated linguistic analysis as in full syntactic parsing and semantic role labeling.</Paragraph>
  </Section>
  <Section position="8" start_page="596" end_page="597" type="metho">
    <SectionTitle>
4 Sequence Tagging
</SectionTitle>
    <Paragraph position="0"> We take a sequence labeling approach to learning a model for supersense tagging. Our goal is to learn a function from input vectors, the observations from labeled data, to response variables, the supersense labels. POS tagging, shallow parsing, NP-chunking and NER are all examples of sequence labeling tasks in which performance can besignificantlyimprovedbyoptimizingthechoice of labeling over whole sequences of words, rather than individual words. The limitations of the generative approach to sequence tagging, i. e. Hidden Markov Models, have been overcome by discriminative approaches proposed in recent years (Mc-Callum et al., 2000; Lafferty et al., 2001; Collins, 2002; Altun et al., 2003). In this paper we apply perceptron trained HMMs originally proposed in (Collins, 2002).</Paragraph>
    <Section position="1" start_page="596" end_page="597" type="sub_section">
      <SectionTitle>
4.1 Perceptron-trained HMM
</SectionTitle>
      <Paragraph position="0"> HMMs define a probabilistic model for observation/label sequences. The joint model of an observation/label sequence (x,y), is defined as:</Paragraph>
      <Paragraph position="2"> where yi is the ith label in the sequence and xi is the ith word. In the NLP literature, a common approach is to model the conditional distribution of label sequences given the label sequences. These models have several advantages over generative models, such as not requiring questionable independence assumptions, optimizing the conditional likelihood directly and employing richer feature representations. This task can be represented as learning a discriminant function F :XxY-IR, on a training data of observation/label sequences, where F is linear in a feature representation Ph defined over the joint input/output space</Paragraph>
      <Paragraph position="4"> Ph is a global feature representation, mapping each (x,y) pair to a vector of feature counts Ph(x,y)[?] IRd, where d is the total number of features. This vector is given by</Paragraph>
      <Paragraph position="6"> Each individual feature phi typically represents a morphological, contextual, or syntactic property, or also the inter-dependence of consecutive labels. These features are described in detail in Section 4.2. Given an observation sequence x, we make a prediction by maximizing F over the response variables:</Paragraph>
      <Paragraph position="8"> respect to the parameter vector w [?] IRd. The complexity of the Viterbi algorithm scales linearly with the length of the sequence.</Paragraph>
      <Paragraph position="9"> There are different ways of estimatingwfor the described model. We use the perceptron algorithm for sequence tagging (Collins, 2002). The perceptron algorithm focuses on minimizing the error rate, without involving any normalization factors.</Paragraph>
      <Paragraph position="10"> This property makes it very efficient which is a desirable feature in a task dealing with a large tagset such as ours. Additionally, the performance of perceptron-trained HMMs is very competitive on a number of tasks; e.g., in shallow parsing, where  Algorithm 1 Hidden Markov average perceptron algorithm.</Paragraph>
      <Paragraph position="11">  1: Initialize w0 =vector0 2: for t = 1....,T do 3: Choose xi 4: Compute ^y = argmaxy[?]Y F(xi,y;w) 5: if yinegationslash= ^y then 6: wt+1-wt + Ph(xi,yi)[?]Ph(xi, ^y) 7: end if 8: w = 1T summationtexttwt 9: end for 10: return w  the perceptron performance is comparable to that of Conditional Random Field models (Sha and Pereira, 2003), The tendency to overfit of the perceptron can be mitigated in a number of ways including regularization and voting. Here we apply averaging and straightforwardly extended Collins algorithm, summarized in Algorithm 1.</Paragraph>
    </Section>
    <Section position="2" start_page="597" end_page="597" type="sub_section">
      <SectionTitle>
4.2 Features
</SectionTitle>
      <Paragraph position="0"> We used the following combination of spelling/morphological and contextual features. For each observed word xi in the data ph extracts the following features:  1. Words: xi, xi[?]1, xi[?]2, xi+1, xi+2; 2. First sense: supersense baseline prediction for xi, fs(xi), cf. Section 5.3; 3. Combined (1) and (2): xi + fs(xi); 4. Pos: posi (the POS of xi), posi[?]1, posi[?]2,  posi+1, posi+2, posi[0], posi[?]1[0], posi[?]2[0], posi+1[0], posi+2[0], pos commi if xi's POS tagsis&amp;quot;NN&amp;quot;or&amp;quot;NNS&amp;quot;(commonnouns), and pos propi if xi's POS is &amp;quot;NNP&amp;quot; or &amp;quot;NNPS&amp;quot; (proper nouns); 5. Word shape: sh(xi), sh(xi[?]1), sh(xi[?]2), sh(xi+1), sh(xi+2), where sh(xi) is as described below. In addition shi = low if the first character of xi is lowercase, shi = cap brk if the first character of xi is uppercase and xi[?]1 is a full stop, question or exclamation mark, or xi is the first word of the sentence, shi = cap nobrk otherwise; 6. Previous label: supersense label yi[?]1.</Paragraph>
      <Paragraph position="1"> Word features (1) are morphologically simplified using the morphological functions of the Word-net library. The first sense feature (2) is the label predicted for xi by the baseline model, cf. Section 5.3. POS labels (4) were generated using Brants' TnT tagger (Brants, 2002). POS features of the form posi[0] extract the first character from the POS label, thus providing a simplified representation of the POS tag. Finally, word shape features (5) are regular expression-like transformation in which each character c of a string s is substituted with X if c is uppercase, if lowercase, c is substituted with x, if c is a digit it is substituted with d and left as it is otherwise. In addition each sequence of two or more identical characters c is substituted with c[?]. For example, for s = &amp;quot;Merrill Lynch&amp; Co.&amp;quot;, sh(s) = Xx[?]Xx[?]&amp;Xx..</Paragraph>
      <Paragraph position="2"> Exploratory experiments with richer feature sets, including syntactic information, affixes, and topic labels associated with words, did not result in improvements in terms of performance. While more experiments are needed to investigate the  usefulnessofothersourcesofinformation,thefeature set described above, while basic, offers good generalization properties.</Paragraph>
    </Section>
  </Section>
  <Section position="9" start_page="597" end_page="599" type="metho">
    <SectionTitle>
5 Experiments
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="597" end_page="598" type="sub_section">
      <SectionTitle>
5.1 Data
</SectionTitle>
      <Paragraph position="0"> We experimented with the following data-sets3.</Paragraph>
      <Paragraph position="1"> The Semcor corpus (Miller et al., 1993), a fraction of the Brown corpus (KuVcera and Francis, 1967) which has been manually annotated with Wordnet synset labels. Named entities of the categories &amp;quot;person&amp;quot;, &amp;quot;location&amp;quot; and &amp;quot;group&amp;quot; are also annotated. The original annotation with Wordnet 1.6 synset IDs has been converted to the most recent version 2.0 of Wordnet. Semcor is divided in three parts: &amp;quot;brown1&amp;quot; and &amp;quot;brown2&amp;quot;, here referred to as &amp;quot;SEM&amp;quot;, in which nouns, verbs, adjectives and adverbs are annotated. In addition, the section &amp;quot;brownv&amp;quot;, &amp;quot;SEMv&amp;quot; here, contains annotations only for verbs. We also experimented with the Senseval-3 English all-words tasks data (Snyder and Palmer, 2004), here called &amp;quot;SE3&amp;quot;. The Senseval all-words task evaluates the performance of WSD systems on all open class words in complete documents. The Senseval-3 data consists of two Wall Street Journal Articles, &amp;quot;wsj 1778&amp;quot; and  senses&amp;quot; lists the number of instances of supersense labels, partitioned, in the following two rows, between verb and noun supersense labels. The lowest four rows summarize average polysemy figures at the synset and supersense level for both nouns and verbs.</Paragraph>
      <Paragraph position="2"> &amp;quot;wsj 1695&amp;quot;, and a fiction excerpt, &amp;quot;cl 23&amp;quot;, from the unannotated portion of the Brown corpus. Table 3 summarizes a fewstatistics about the composition of the datasets. The four lower rows report the average polysemy of nouns (&amp;quot;N&amp;quot;) and verbs (&amp;quot;V&amp;quot;), in each dataset, both at the synset level (&amp;quot;WS&amp;quot;) and supersense (&amp;quot;SS&amp;quot;) level. The average number of senses decreases significantly when the more general sense inventory is considered.</Paragraph>
      <Paragraph position="3"> We substituted the corresponding supersense to each noun and verb synset in all three data-sets: SEM, SEMv and SE3. All other tokens were labeled &amp;quot;0&amp;quot;. The supersense label &amp;quot;noun.Tops&amp;quot; refers to 45 synsets which lie at the very top of the Wordnet noun hierarchy. Some of these synsets are expressed by very general nouns such as &amp;quot;biont&amp;quot;, &amp;quot;benthos&amp;quot;, &amp;quot;whole&amp;quot;, and &amp;quot;nothing&amp;quot;. However, others undoubtedly refer to other supersenses, for which they provide the label, such as &amp;quot;food&amp;quot;, &amp;quot;person&amp;quot;, &amp;quot;plant&amp;quot; or &amp;quot;animal&amp;quot;. Since these nouns tend to be fairly frequent, it is confusing and inconsistent to label them &amp;quot;noun.Tops&amp;quot;; e.g., nouns such as &amp;quot;chowder&amp;quot; and &amp;quot;Swedish meatball&amp;quot; would be tagged as &amp;quot;noun.food&amp;quot;, but the noun &amp;quot;food&amp;quot; would be tagged as &amp;quot;noun.Tops&amp;quot;. For this reason, in all obvious cases, we substituted the &amp;quot;noun.Tops&amp;quot; label with the more specific supersense label for the noun4.</Paragraph>
      <Paragraph position="4"> The SEMv dataset only includes supersense labels for verbs. In order to avoid unwanted false negatives, that is, thousands of nouns labeled &amp;quot;0&amp;quot;, 4The nouns which are left with the &amp;quot;noun.Top&amp;quot; label are: entity, thing, anything, something, nothing, object, living thing, organism, benthos, heterotroph, life, and biont.</Paragraph>
      <Paragraph position="5"> we applied the following procedure. Rather than using the full sentences from the SEMv dataset,  fromeachsentencewegeneratedthefragmentsincluding a verb but no common or proper nouns; e.g., from a sentence such as &amp;quot;Karns' ruling pertainedverb.stative to eight of the 10 cases.&amp;quot; only the fragment &amp;quot;pertainedverb.stative to eight of the 10&amp;quot; is extracted and used for training.</Paragraph>
      <Paragraph position="6"> Sometimes more than one label is assigned to a word, in all data-sets. In these cases we adopted the heuristic of only using the first label in the data as the correct synset/supersense. We leave the extension of the tagger to the multilabel case for future research. As for now, we can expect that this solution will simply lower, somewhat, both the baseline and the tagger performance. Finally, we adopted a beginning (B) and continuation of entity (I) plus no label (0), encoding; i.e., the actual class space defines 83 labels.</Paragraph>
    </Section>
    <Section position="2" start_page="598" end_page="598" type="sub_section">
      <SectionTitle>
5.2 Setup
</SectionTitle>
      <Paragraph position="0"> The supersense tagger was trained on the Semcor datasets SEM and SEMv. The only free parameter to set in evaluation is the number of iterations to perform T (cf. Algorithm 1). We evaluated the model's accuracy on Semcor by splitting the SEM data randomly in training, development and evaluation. In a 5-fold cross-validation setup the tagger was trained on 4/5 of the SEM data, the remaining data was split in two halves, one used to fix T the other for evaluating performance on test. The full SEMv data was always added to the training portion of SEM. We also evaluated the model on the Senseval-3 data, using the same value for T set by cross-validation on the SEM data5. The ordering of the training instances is randomized across different runs, therefore the algorithm outputs different results after each run, even if the evaluation set is fixed, as is the case for the Senseval evaluation. The variance in the results on the SE3 data was measured in this way.</Paragraph>
    </Section>
    <Section position="3" start_page="598" end_page="599" type="sub_section">
      <SectionTitle>
5.3 Baseline tagger
</SectionTitle>
      <Paragraph position="0"> The first sense baseline is the supersense of the most frequent synset for a word, according to Wordnet's sense ranking. This baseline is very competitiveinWSDtasks, anditisextremelyhard to improve upon even slightly. In fact, the baseline has been proposed as a good alternative to WSD  computed on the five trials results.</Paragraph>
      <Paragraph position="1"> altogether (cf. (McCarthy et al., 2004)). For this reason we include the first sense prediction as one of the features of our tagging model.</Paragraph>
      <Paragraph position="2"> We apply the heuristic as follows. First, in each sentence, we identify the longest sequence which has an entry in Wordnet as either noun or verb. We carry out this step using the Wordnet's library functions, which perform also morphological simplification. Hence, in Example 1 the entry &amp;quot;stand up&amp;quot; is detected, although also &amp;quot;stand&amp;quot; has an entry in Wordnet. Then, each word identified in this way is assigned its most frequent sense - the only one available if the word is unambiguous. To reduce the number of candidate supersenses we distinguish between common and proper nouns; e.g. &amp;quot;Savannah&amp;quot; (city/river) is distinguished from &amp;quot;savannah&amp;quot; (grassland). This method improves slightly the accuracy of the baseline which does not distinguish between different types of nouns.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML