File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-1013_metho.xml

Size: 21,287 bytes

Last Modified: 2025-10-06 14:10:16

<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-1013">
  <Title>Ensemble Methods for Unsupervised WSD</Title>
  <Section position="4" start_page="97" end_page="99" type="metho">
    <SectionTitle>
2 The Disambiguation Algorithms
</SectionTitle>
    <Paragraph position="0"> In this section we brie y describe the unsupervised WSD algorithms used in our experiments.</Paragraph>
    <Paragraph position="1"> We selected methods that vary along the following dimensions: (a) the type of WSD performed (i.e., token-based vs. type-based), (b) the representation and size of the context surrounding an ambiguous word (i.e., graph-based vs. word-based, document vs. sentence), and (c) the number and type of semantic relations considered for disambiguation. We base most of our discussion below on the WordNet sense inventory; however, the approaches are not limited to this particular lexicon but could be adapted for other resources with traditional dictionary-like sense de nitions and alternative structure.</Paragraph>
    <Paragraph position="2"> Extended Gloss Overlap Gloss Overlap was originally introduced by Lesk (1986) for performing token-based WSD. The method assigns a sense to a target word by comparing the dictionary de nitions of each of its senses with those of the words in the surrounding context. The sense whose de nition has the highest overlap (i.e., words in common) with the context words is assumed to be the correct one. Banerjee and Pedersen (2003) augment the dictionary de nition (gloss) of each sense with the glosses of related words and senses. The extended glosses increase the information available in estimating the overlap between ambiguous words and their surrounding context.</Paragraph>
    <Paragraph position="3"> The range of relationships used to extend the glosses is a parameter, and can be chosen from any combination of WordNet relations. For every sense sk of the target word we estimate:</Paragraph>
    <Paragraph position="5"> where context is a simple (space separated) concatenation of all words wi for n i n,i 6= 0 in a context window of length n around the target word w0. The overlap scoring mechanism is also parametrized and can be adjusted to take the into account gloss length or to ignore function words.</Paragraph>
    <Paragraph position="6"> Distributional and WordNet Similarity McCarthy et al. (2004) propose a method for automatically ranking the senses of ambiguous words from raw text. Key in their approach is the observation that distributionally similar neighbors often provide cues about a word's senses. Assuming that a set of neighbors is available, sense ranking is equivalent to quantifying the degree of similarity among the neighbors and the sense descriptions of the polysemous word.</Paragraph>
    <Paragraph position="7"> Let N(w) = fn1,n2,...,nkg be the k most (distributionally) similar words to an ambiguous target word w and senses(w) = fs1,s2,...sng the set of senses for w. For each sense si and for each neighbor n j, the algorithm selects the neighbor's sense which has the highest WordNet similarity score (wnss) with regard to si. The ranking score of sense si is then increased as a function of the WordNet similarity score and the distributional similarity score (dss) between the target word and the neighbor:</Paragraph>
    <Paragraph position="9"> The predominant sense is simply the sense with the highest ranking score (RankScore) and can be consequently used to perform type-based disambiguation. The method presented above has four parameters: (a) the semantic space model representing the distributional properties of the target words (it is acquired from a large corpus representative of the domain at hand and can be augmented with syntactic relations such as subject or object), (b) the measure of distributional similarity for discovering neighbors (c) the number of neighbors that the ranking score takes into account, and (d) the measure of sense similarity.</Paragraph>
    <Paragraph position="10"> Lexical Chains Lexical cohesion is often represented via lexical chains, i.e., sequences of related words spanning a topical text unit (Morris and Hirst, 1991). Algorithms for computing lexical chains often perform WSD before inferring which words are semantically related. Here we describe one such disambiguation algorithm, proposed by Galley and McKeown (2003), while omitting the details of creating the lexical chains themselves.</Paragraph>
    <Paragraph position="11"> Galley and McKeown's (2003) method consists of two stages. First, a graph is built representing all possible interpretations of the target words  in question. The text is processed sequentially, comparing each word against all words previously read. If a relation exists between the senses of the current word and any possible sense of a previous word, a connection is formed between the appropriate words and senses. The strength of the connection is a function of the type of relationship and of the distance between the words in the text (in terms of words, sentences and paragraphs). Words are represented as nodes in the graph and semantic relations as weighted edges. Again, the set of relations being considered is a parameter that can be tuned experimentally.</Paragraph>
    <Paragraph position="12"> In the disambiguation stage, all occurrences of a given word are collected together. For each sense of a target word, the strength of all connections involving that sense are summed, giving that sense a uni ed score. The sense with the highest uni ed score is chosen as the correct sense for the target word. In subsequent stages the actual connections comprising the winning uni ed score are used as a basis for computing the lexical chains.</Paragraph>
    <Paragraph position="13"> The algorithm is based on the one sense per discourse hypothesis and uses information from every occurrence of the ambiguous target word in order to decide its appropriate sense. It is therefore a type-based algorithm, since it tries to determine the sense of the word in the entire document/discourse at once, and not separately for each instance.</Paragraph>
    <Paragraph position="14">  spired by lexical chains, Navigli and Velardi (2005) developed Structural Semantic Interconnections (SSI), a WSD algorithm which makes use of an extensive lexical knowledge base. The latter is primarily based on WordNet and its standard relation set (i.e., hypernymy, meronymy, antonymy, similarity, nominalization, pertainymy) but is also enriched with collocation information representing semantic relatedness between sense pairs. Collocations are gathered from existing resources (such as the Oxford Collocations, the Longman Language Activator, and collocation web sites).</Paragraph>
    <Paragraph position="15"> Each collocation is mapped to the WordNet sense inventory in a semi-automatic manner (Navigli, 2005) and transformed into a relatedness edge.</Paragraph>
    <Paragraph position="16"> Given a local word context C = fw1,...,wng, SSI builds a graph G = (V,E) such that V =</Paragraph>
    <Paragraph position="18"> senses(wi) and (s,sprime) 2 E if there is at least one interconnection j between s (a sense of the word) and sprime (a sense of its context) in the lexical knowledge base. The set of valid interconnections is determined by a manually-created context-free  grammar consisting of a small number of rules.</Paragraph>
    <Paragraph position="19"> Valid interconnections are computed in advance on the lexical database, not at runtime.</Paragraph>
    <Paragraph position="20"> Disambiguation is performed in an iterative fashion. At each step, for each sense s of a word in C (the set of senses of words yet to be disambiguated), SSI determines the degree of connectivity between s and the other senses in C:</Paragraph>
    <Paragraph position="22"> where Interconn(s,sprime) is the set of interconnections between senses s and sprime. The contribution of a single interconnection is given by the reciprocal of its length, calculated as the number of edges connecting its ends. The overall degree of connectivity is then normalized by the number of contributing interconnections. The highest ranking sense s of word wi is chosen and the senses of wi are removed from the context C. The procedure terminates when either C is the empty set or there is no sense such that its SSIScore exceeds a xed threshold. null Summary The properties of the different WSD algorithms just described are summarized in Table 1. The methods vary in the amount of data they employ for disambiguation. SSI and Extended Gloss Overlap (Overlap) rely on sentence-level information for disambiguation whereas Mc-Carthy et al. (2004) (Similarity) and Galley and McKeown (2003) (LexChains) utilize the entire document or corpus. This enables the accumulation of large amounts of data regarding the ambiguous word, but does not allow separate consideration of each individual occurrence of that word. LexChains and Overlap take into account a restricted set of semantic relations (paths of length one) between any two words in the whole document, whereas SSI and Similarity use a wider set of relations.</Paragraph>
  </Section>
  <Section position="5" start_page="99" end_page="100" type="metho">
    <SectionTitle>
3 Experiment 1: Comparison of
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="99" end_page="99" type="sub_section">
      <SectionTitle>
Unsupervised Algorithms for WSD
3.1 Method
</SectionTitle>
      <Paragraph position="0"> We evaluated the disambiguation algorithms outlined above on two tasks: predominant sense acquisition and token-based WSD. As previously explained, Overlap and SSI were not designed for acquiring predominant senses (see Table 1), but a token-based WSD algorithm can be trivially modi ed to acquire predominant senses by disambiguating every occurrence of the target word in context and selecting the sense which was chosen most frequently. Type-based WSD algorithms simply tag all occurrences of a target word with its predominant sense, disregarding the surrounding context.</Paragraph>
      <Paragraph position="1"> Our rst set of experiments was conducted on the SemCor corpus, on the same 2,595 polysemous nouns (53,674 tokens) used as a test set by McCarthy et al. (2004). These nouns were attested in SemCor with a frequency &gt; 2 and occurred in the British National Corpus (BNC) more than 10 times. We used the WordNet 1.7.1 sense inventory.</Paragraph>
      <Paragraph position="2"> The following notation describes our evaluation measures: W is the set of all noun types in the SemCor corpus (jWj = 2,595), and Wf is the set of noun types with a dominant sense. senses(w) is the set of senses for noun type w, while fs(w) and fm(w) refer to w's rst sense according to the SemCor gold standard and our algorithms, respectively. Finally, T(w) is the set of tokens of w and senses(t) denotes the sense assigned to token t according to SemCor.</Paragraph>
      <Paragraph position="3"> We rst measure how well our algorithms can identify the predominant sense, if one exists: Accps = jfw 2Wf j fs(w) = fm(w)gjjW fj A baseline for this task can be easily de ned for each word type by selecting a sense at random from its sense inventory and assuming that this is the predominant sense:</Paragraph>
      <Paragraph position="5"> We evaluate the algorithms' disambiguation performance by measuring the ratio of tokens for which our models choose the right sense:</Paragraph>
      <Paragraph position="7"> In the predominant sense detection task, in case of ties in SemCor, any one of the predominant senses was considered correct. Also, all algorithms were designed to randomly choose from among the top scoring options in case of a tie in the calculated scores. This introduces a small amount of randomness (less than 0.5%) in the accuracy calculation, and was done to avoid the pitfall of defaulting to the rst sense listed in WordNet, which is usually the actual predominant sense (the order of senses in WordNet is based primarily on the SemCor sense distribution).</Paragraph>
    </Section>
    <Section position="2" start_page="99" end_page="100" type="sub_section">
      <SectionTitle>
3.2 Parameter Settings
</SectionTitle>
      <Paragraph position="0"> We did not speci cally tune the parameters of our WSD algorithms on the SemCor corpus, as our goal was to use hand labeled data solely for testing purposes. We selected parameters that have been considered optimal in the literature, although admittedly some performance gains could be expected had parameter optimization taken place.</Paragraph>
      <Paragraph position="1"> For Overlap, we used the semantic relations proposed by Banerjee and Pedersen (2003), namely hypernyms, hyponyms, meronyms, holonyms, and troponym synsets. We also adopted their overlap scoring mechanism which treats each gloss as a bag of words and assigns an n word overlap the score of n2. Function words were not considered in the overlap computation.</Paragraph>
      <Paragraph position="2"> For LexChains, we used the relations reported in Galley and McKeown (2003). These are all rst-order WordNet relations, with the addition of the siblings two words are considered siblings if they are both hyponyms of the same hypernym.</Paragraph>
      <Paragraph position="3"> The relations have different weights, depending on their type and the distance between the words in the text. These weights were imported from Galley and McKeown into our implementation without modi cation.</Paragraph>
      <Paragraph position="4"> Because the SemCor corpus is relatively small (less than 700,00 words), it is not ideal for constructing a neighbor thesaurus appropriate for Mc-Carthy et al.'s (2004) method. The latter requires each word to participate in a large number of co-occurring contexts in order to obtain reliable distributional information. To overcome this problem, we followed McCarthy et al. and extracted the neighbor thesaurus from the entire BNC. We also recreated their semantic space, using a RASPparsed (Briscoe and Carroll, 2002) version of the BNC and their set of dependencies (i.e., Verb-Object, Verb-Subject, Noun-Noun and Adjective-Noun relations). Similarly to McCarthy et al., we used Lin's (1998) measure of distributional similarity, and considered only the 50 highest ranked  gorithms on SemCor nouns2 ([?]: sig. diff. from Baseline, : sig. diff. from Similarity, $: sig diff. from SSI, #: sig. diff. from Overlap, p &lt; 0.01) neighbors for a given target word. Sense similarity was computed using the Lesk's (Banerjee and Pedersen, 2003) similarity measure1.</Paragraph>
    </Section>
    <Section position="3" start_page="100" end_page="100" type="sub_section">
      <SectionTitle>
3.3 Results
</SectionTitle>
      <Paragraph position="0"> The performance of the individual algorithms is shown in Table 2. We also include the baseline discussed in Section 3 and the upper bound of defaulting to the rst (i.e., most frequent) sense provided by the manually annotated SemCor. We report predominant sense accuracy (Accps), and WSD accuracy when using the automatically acquired predominant sense (Accwsd=ps). For token-based algorithms, we also report their WSD performance in context, i.e., without use of the predominant sense (Accwsd=dir).</Paragraph>
      <Paragraph position="1"> As expected, the accuracy scores in the WSD task are lower than the respective scores in the predominant sense task, since detecting the predominant sense correctly only insures the correct tagging of the instances of the word with that rst sense. All methods perform signi cantly better than the baseline in the predominant sense detection task (using a kh2-test, as indicated in Table 2). LexChains and Overlap perform significantly worse than Similarity and SSI, whereas LexChains is not signi cantly different from Overlap. Likewise, the difference in performance between SSI and Similarity is not signi cant. With respect to WSD, all the differences in performance are statistically signi cant.</Paragraph>
      <Paragraph position="2">  (2003), since they tested on a subset of SemCor, and included monosemous nouns. They also used the rst sense in SemCor in case of ties. The results for the Similarity method are slightly better than those reported by McCarthy et al. (2004) due to minor improvements in implementation.</Paragraph>
      <Paragraph position="3">  tecting the predominant sense (as % of all words) Interestingly, using the predominant sense detected by the Gloss Overlap and the SSI algorithm to tag all instances is preferable to tagging each instance individually (compare Accwsd=dir and Accwsd=ps for Overlap and SSI in Table 2).</Paragraph>
      <Paragraph position="4"> This means that a large part of the instances which were not tagged individually with the predominant sense were actually that sense.</Paragraph>
      <Paragraph position="5"> A close examination of the performance of the individual methods in the predominant-sense detection task shows that while the accuracy of all the methods is within a range of 7%, the actual words for which each algorithm gives the correct predominant sense are very different. Table 3 shows the degree of overlap in assigning the appropriate predominant sense among the four methods. As can be seen, the largest amount of overlap is between Similarity and SSI, and this corresponds approximately to 23 of the words they correctly label. This means that each of these two methods gets more than 350 words right which the other labels incorrectly.</Paragraph>
      <Paragraph position="6"> If we had an oracle which would tell us which method to choose for each word, we would achieve approximately 82.4% in the predominant sense task, giving us 58% in the WSD task. We see that there is a large amount of complementation between the algorithms, where the successes of one make up for the failures of the others. This suggests that the errors of the individual methods are suf ciently uncorrelated, and that some advantage can be gained by combining their predictions.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="100" end_page="101" type="metho">
    <SectionTitle>
4 Combination Methods
</SectionTitle>
    <Paragraph position="0"> An important nding in machine learning is that a set of classi ers whose individual decisions are combined in some way (an ensemble) can be more accurate than any of its component classi ers, provided that the individual components are relatively accurate and diverse (Dietterich, 1997). This simple idea has been applied to a variety of classication problems ranging from optical character recognition to medical diagnosis, part-of-speech tagging (see Dietterich 1997 and van Halteren et al. 2001 for overviews), and notably supervised  WSD (Florian et al., 2002).</Paragraph>
    <Paragraph position="1"> Since our effort is focused exclusively on unsupervised methods, we cannot use most machine learning approaches for creating an ensemble (e.g., stacking, con dence-based combination), as they require a labeled training set. We therefore examined several basic ensemble combination approaches that do not require parameter estimation from training data.</Paragraph>
    <Paragraph position="2"> We de ne Score(Mi,s j) as the (normalized) score which a method Mi gives to word sense s j.</Paragraph>
    <Paragraph position="3"> The predominant sense calculated by method Mi for word w is then determined by:</Paragraph>
    <Paragraph position="5"> All ensemble methods receive a set fMigki=1 of individual methods to combine, so we denote each ensemble method by MethodName(fMigki=1).</Paragraph>
    <Paragraph position="6"> Direct Voting Each ensemble component has one vote for the predominant sense, and the sense with the most votes is chosen. The scoring function for the voting ensemble is de ned as:</Paragraph>
    <Paragraph position="8"> where eq[s,PS(Mi,w)] = braceleftbigg 1 if s = PS(M  provides a ranking of the senses for a given target word. For each sense, its placements according to each of the methods are summed and the sense with the lowest total placement (closest to rst place) wins.</Paragraph>
    <Paragraph position="10"> where Placei(s) is the number of distinct scores that are larger or equal to Score(Mi,s).</Paragraph>
    <Paragraph position="11"> Arbiter-based Combination One WSD method can act as an arbiter for adjudicating disagreements among component systems. It makes sense for the adjudicator to have reasonable performance on its own. We therefore selected  diff. from Similarity, $: sig. diff. from SSI, : sig. diff. from Voting, p &lt; 0.01) SSI as the arbiter since it had the best accuracy on the WSD task (see Table 2). For each disagreed word w, and for each sense s of w assigned by any of the systems in the ensemble fMigki=1, we calculate the following score:</Paragraph>
    <Paragraph position="13"> where SSIScore[?](s) is a modi ed version of the score introduced in Section 2 which exploits as a context for s the set of agreed senses and the remaining words of each sentence. We exclude from the context used by SSI the senses of w which were not chosen by any of the systems in the ensemble . This effectively reduces the number of senses considered by the arbiter and can positively in uence the algorithm's performance, since it eliminates noise coming from senses which are likely to be wrong.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML