File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-1669_metho.xml

Size: 26,048 bytes

Last Modified: 2025-10-06 14:10:45

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-1669">
  <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics Two graph-based algorithms for state-of-the-art WSD</Title>
  <Section position="4" start_page="585" end_page="586" type="metho">
    <SectionTitle>
2 A graph algorithm for corpus-based
WSD
</SectionTitle>
    <Paragraph position="0"> The basic steps for our implementation of HyperLex and its variant using PageRank are common.</Paragraph>
    <Paragraph position="1"> We rst build the cooccurrence graph, then we select the hubs that are going to represent the senses using two different strategies inspired by HyperLex and PageRank. We are then ready to use the induced senses to do word sense disambiguation.</Paragraph>
    <Section position="1" start_page="585" end_page="585" type="sub_section">
      <SectionTitle>
2.1 Building cooccurrence graphs
</SectionTitle>
      <Paragraph position="0"> For each word to be disambiguated, a text corpus is collected, consisting of the paragraphs where the word occurs. From this corpus, a cooccurrence graph for the target word is built. Vertices in the graph correspond to words2 in the text (except the target word itself). Two words appearing in the same paragraph are said to cooccur, and are connected with edges. Each edge is assigned a weight which measures the relative frequency of the two words cooccurring. Speci cally, let wij be the weight of the edge3 connecting nodes i and j, then wij = 1 [?] max[P(i  |j), P(j  |i)], where</Paragraph>
      <Paragraph position="2"> The weight of an edge measures how tightly connected the two words are. Words which always occur together receive a weight of 0. Words rarely cooccurring receive weights close to 1.</Paragraph>
    </Section>
    <Section position="2" start_page="585" end_page="586" type="sub_section">
      <SectionTitle>
2.2 Selecting hubs: HyperLex vs. PageRank
</SectionTitle>
      <Paragraph position="0"> Once the cooccurrence graph is built, V*eronis proposes a simple iterative algorithm to obtain its hubs. At each step, the algorithm nds the vertex with highest relative frequency4 in the graph, and, if it meets some criteria, it is selected as a hub.</Paragraph>
      <Paragraph position="1"> These criteria are determined by a set of heuristic parameters, that will be explained later in Section 4. After a vertex is selected to be a hub, its neighbors are no longer eligible as hub candidates. At any time, if the next vertex candidate has a relative frequency below a certain threshold, the algorithm stops.</Paragraph>
      <Paragraph position="2"> Another alternative is to use the PageRank algorithm (Brin and Page, 1998) for nding hubs in the  and its degree are linearly related, and it is therefore possible to avoid the costly computation of the degree.</Paragraph>
      <Paragraph position="3">  coocurrence graph. PageRank is an iterative algorithm that ranks all the vertices according to their relative importance within the graph following a random-walk model. In this model, a link between vertices v1 and v2 means that v1 recommends v2.</Paragraph>
      <Paragraph position="4"> The more vertices recommend v2, the higher the rank of v2 will be. Furthermore, the rank of a vertex depends not only on how many vertices point to it, but on the rank of these vertices as well.</Paragraph>
      <Paragraph position="5"> Although PageRank was initially designed to work with directed graphs, and with no weights in links, the algorithm can be easily extended to model undirected graphs whose edges are weighted. Speci cally, let G = (V, E) be an undirected graph with the set of vertices V and set of edges E. For a given vertex vi, let In(vi) be the set of vertices pointing to it5. The rank of vi is de ned as:</Paragraph>
      <Paragraph position="7"> where wij is the weight of the link between vertices vi and vj, and 0 [?] d [?] 1. d is called the damping factor and models the probability of a web surfer standing at a vertex to follow a link from this vertex (probability d) or to jump to a random vertex in the graph (probability 1 [?] d). The factor is usually set at 0.85.</Paragraph>
      <Paragraph position="8"> The algorithm initializes the ranks of the vertices with a xed value (usually 1N for a graph with N vertices) and iterates until convergence below a given threshold is achieved, or, more typically, until a xed number of iterations are executed. Note that the convergence of the algorithms doesn't depend on the initial value of the ranks.</Paragraph>
      <Paragraph position="9"> After running the algorithm, the vertices of the graph are ordered in decreasing order according to its rank, and a number of them are chosen as the main hubs of the word. The hubs nally selected depend again of some heuristics and will be described in section 4.</Paragraph>
    </Section>
    <Section position="3" start_page="586" end_page="586" type="sub_section">
      <SectionTitle>
2.3 Using hubs for WSD
</SectionTitle>
      <Paragraph position="0"> Once the hubs that represent the senses of the word are selected (following any of the methods presented in the last section), each of them is linked to the target word with edges weighting 0, and the Minimum Spanning Tree (MST) of the whole graph is calculated and stored.</Paragraph>
      <Paragraph position="1"> 5As G is undirected, the in-degree of a vertex v is equal to its out-degree.</Paragraph>
      <Paragraph position="2"> The MST is then used to perform word sense disambiguation, in the following way. For every instance of the target word, the words surrounding it are examined and looked up in the MST. By construction of the MST, words in it are placed under exactly one hub. Each word in the context receives a set of scores s, with one score per hub, where all scores are 0 except the one corresponding to the hub where it is placed. If the scores are organized in a score vector, all values are 0, except, say, the i-th component, which receives a score d(hi, v), which is the distance between the hub hi and the node representing the word v. Thus, d(hi, v) assigns a score of 1 to hubs and the score decreases as the nodes move away from the hub in the tree.</Paragraph>
      <Paragraph position="3"> For a given occurrence of the target word, the score vectors of all the words in the context are added, and the hub that receives the maximum score is chosen.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="586" end_page="587" type="metho">
    <SectionTitle>
3 Evaluating unsupervised WSD systems
</SectionTitle>
    <Paragraph position="0"> All unsupervised WSD algorithms need some addition in order to be evaluated. One alternative, as in (V*eronis, 2004), is to manually decide the correctness of the hubs assigned to each occurrence of the words. This approach has two main disadvantages. First, it is expensive to manually verify each occurrence of the word, and different runs of the algorithm need to be evaluated in turn. Second, it is not an easy task to manually decide if an occurrence of a word effectively corresponds with the use of the word the assigned hub refers to, specially considering that the person is given a short list of words linked to the hub. Besides, it is widely acknowledged that people are leaned not to contradict the proposed answer.</Paragraph>
    <Paragraph position="1"> A second alternative is to evaluate the system according to some performance in an application, e.g. information retrieval (Schcurrency1utze, 1998). This is a very attractive idea, but requires expensive system development and it is sometimes dif cult to separate the reasons for the good (or bad) performance. null A third alternative would be to devise a method to map the hubs (clusters) returned by the system to the senses in a lexicon. Pantel and Lin (2002) automatically mapped the senses to WordNet, and then measured the quality of the mapping. More recently, tagged corpora have been used to map the induced senses, and then compare the systems over publicly available benchmarks (Puran null dare and Pedersen, 2004; Niu et al., 2005; Agirre et al., 2006), which offers the advantage of comparing to other systems, but converts the whole system into semi-supervised. See Section 5 for more details on these systems. Note that the mapping introduces noise and information loss, which is a disadvantage when comparing to other systems that rely on the gold-standard senses.</Paragraph>
    <Paragraph position="2"> Yet another possibility is to evaluate the induced senses against a gold standard as a clustering task.</Paragraph>
    <Paragraph position="3"> Induced senses are clusters, gold standard senses are classes, and measures from the clustering literature like entropy or purity can be used. In this case the manually tagged corpus is taken to be the gold standard, where a class is the set of examples tagged with a sense.</Paragraph>
    <Paragraph position="4"> We decided to adopt the last two alternatives, since they allow for comparison over publicly available systems of any kind.</Paragraph>
    <Section position="1" start_page="587" end_page="587" type="sub_section">
      <SectionTitle>
3.1 Evaluation of clustering: hubs as clusters
</SectionTitle>
      <Paragraph position="0"> In this setting the selected hubs are treated as clusters of examples and gold standard senses are classes. In order to compare the clusters with the classes, hand annotated corpora are needed (for instance Senseval). The test set is rst tagged with the induced senses. A perfect clustering solution will be the one where each cluster has exactly the same examples as one of the classes, and vice versa. The evaluation is completely unsupervised.</Paragraph>
      <Paragraph position="1"> Following standard cluster evaluation practice (Zhao and Karypis, 2005), we consider three measures: entropy, purity and Fscore. The entropy measure considers how the various classes of objects are distributed within each cluster. In general, the smaller the entropy value, the better the clustering algorithm performs. The purity measure considers the extent to which each cluster contained objects from primarily one class. The larger the values of purity, the better the clustering algorithm performs. The Fscore is used in a similar fashion to Information Retrieval exercises, with precision and recall de ned as the percentage of correctly retrieved examples for a cluster (divided by total cluster size), and recall as the percentage of correctly retrieved examples for a cluster (divided by total class size). For a formal de nition refer to (Zhao and Karypis, 2005). If the clustering is identical to the original classes in the datasets, FScore will be equal to one which means that the higher the FScore, the better the clustering is.</Paragraph>
    </Section>
    <Section position="2" start_page="587" end_page="587" type="sub_section">
      <SectionTitle>
3.2 Evaluation as supervised WSD: mapping
</SectionTitle>
      <Paragraph position="0"> hubs to senses (Agirre et al., 2006) presents a straightforward framework that uses hand-tagged material in order to map the induced senses into the senses used in a gold standard . The WSD system rst tags the training part of some hand-annotated corpus with the induced hubs. The hand labels are then used to construct a matrix relating assigned hubs to existing senses, simply counting the times an occurrence with sense sj has been assigned hub hi. In the testing step we apply the WSD algorithm over the test corpus, using the hubs-to-senses matrix to select the sense with highest weights. See (Agirre et al., 2006) for further details.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="587" end_page="587" type="metho">
    <SectionTitle>
4 Tuning the parameters
</SectionTitle>
    <Paragraph position="0"> The behavior of the original HyperLex algorithm was in uenced by a set of heuristic parameters, which were set by V*eronis following his intuition.</Paragraph>
    <Paragraph position="1"> In (Agirre et al., 2006) we tuned the parameters using the mapping strategy for evaluation. We set a range for each of the parameters, and evaluated the algorithm for each combination of the parameters on a xed set of words (S2LS), which was different from the nal test sets (S3LS and S3AW). This ensures that the chosen parameter set can be used for any noun, and is not over tted to a small set of nouns.</Paragraph>
    <Paragraph position="2"> In this paper, we perform the parameter tuning according to four different criteria, i.e., best supervised performance and best unsupervised entropy/purity/FScore performance. At the end, we have four sets of parameters (those that obtained the best results in S2LS for each criterion), and each set is then selected to be run against the S3LS and S3AW datasets.</Paragraph>
    <Paragraph position="3"> The parameters of the graph-based algorithm can be divided in two sets: those that affect how the cooccurrence graph is built (p1 p4 below), and those that control the way the hubs are extracted  Both strategies to select hubs from the coocurrence graph (cf. Sect. 2.2) share parameters p1 p4. The algorithm proposed by V*eronis uses p5 p6 as requirements for hubs, and p7 as the threshold to stop looking for more hubs: candidates with frequency below p7 are not eligible to be hubs.</Paragraph>
    <Paragraph position="4"> Regarding PageRank the original formulation does not have any provision for determining which are hubs and which not, it just returns a weighted list of vertices. We have experimented with two methods: a threshold for the frequency of the hubs (as before, p7), and a xed number of hubs for every target word (p8). For a shorthand we use Vr for Veronis' original formulation with default parameters, Vr opt for optimized parameters, and Pr fr and Pr fx respectively for the two ways of using PageRank.</Paragraph>
    <Paragraph position="5"> Table 1 lists the parameters of the HyperLex algorithm, with the default values proposed for them in the original work (second column), the ranges that we explored, and the optimal values according to the supervised recall evaluation (cf. Sect. 3.1). For Vr opt we tried 6700 combinations. PageRank has less parameters, and we also used the previous optimization of Vr opt to limit the range of p4, so Pr fr and Pr fx get respectively 180 and 288 combinations. null</Paragraph>
  </Section>
  <Section position="7" start_page="587" end_page="591" type="metho">
    <SectionTitle>
5 Experiment setting and results
</SectionTitle>
    <Paragraph position="0"> To evaluate the HyperLex algorithm in a standard benchmark, we will rst focus on a more extensive evaluation of S3LS and then see the results in S3AW (cf. Sec. 5.4). Following the design for evaluation explained in Section 3, we use the standard train-test split for the supervised evaluation, while the unsupervised evaluation only uses the test part.</Paragraph>
    <Paragraph position="1"> Table 2 shows the results of the 4 variants of our algorithm. Vr stands for the original Veronis algorithm with default parameters, Vr opt to our optimized version, and Pr fr and Pr fx to the Sup. Unsupervised Rec. Entr. Pur. FS  recall, Entropy, Purity and Fscore) and evaluated on S3LS according to the respective evaluation criteria (in the columns). Two baselines, plus 3 supervised and 5 unsupervised systems are also shown. Bold is used for best results in each category. two variants of PageRank. In the columns we nd the evaluation results according to our 4 criteria. For supervised evaluation we indicate only recall, which in our case equals precision, as the coverage is 100% in all cases (values returned by the of cial Senseval scorer). We also include 2 baselines, a system returning a single cluster (that of the most frequent sense, MFS), and another returning one cluster for each example (1ex-1hub). The last rows list the results for 3 supervised and 5 unsupervised systems (see Sect. 5.1). We will comment on the result of this table from different perspectives.</Paragraph>
    <Section position="1" start_page="587" end_page="589" type="sub_section">
      <SectionTitle>
5.1 Supervised evaluation
</SectionTitle>
      <Paragraph position="0"> In this subsection we will focus in the rst four evaluation rows in Table 2. All variants of the algorithm outperform by an ample margin the MFS and the 1ex-1hub baselines when evaluated on S3LS recall. This means that the method is able to learn useful hubs. Note that we perform this supervised evaluation just for comparison with other systems, and to prove that we are able to provide high performance WSD.</Paragraph>
      <Paragraph position="1"> The default parameter setting (Vr) gets the worst results, followed by the xed-hub implementation of PageRank (Pr fx). Pagerank with frequency threshold (Pr fr) and the optimized Veronis (Vr opt) obtain a 10 point improvement over the MFS baseline with very similar results (the difference is not statistically signi cant according to McNemar's test at 95% con dence  level).</Paragraph>
      <Paragraph position="2"> Table 2 also shows the results of three supervised systems. These results (and those of the other unsupervised systems in the table) where obtained from the Senseval website, and the only processing we did was to lter nouns. S3LS-best stands for the the winner of S3LS (Mihalcea et al., 2004), which is 8.3 points over our method. We also include the results of two of our in-house systems. kNN-all is a state-of-the-art system (Agirre et al., 2005) using wide range of local and topical features, and only 2.3 points below the best S3LS system. kNN-BoW which is the same supervised system, but restricted to bag-of-words features only, which are the ones used by our graph-based systems. The table shows that Vr opt and Pr fr are one single point from kNN-BoW, which is an impressive result if we take into account the information loss of the mapping step and that we tuned our parameters on a different set of words.</Paragraph>
      <Paragraph position="3"> The last 5 rows of Table 2 show several unsupervised systems, all of which except Cymfony (Niu et al., 2005) and (Purandare and Pedersen, 2004) participated in S3LS (check (Mihalcea et al., 2004) for further details on the systems). We classify them according to the amount of supervision they have: some have access to most-frequent information (MFS-S3 if counted over S3LS, MFS-Sc if counted over SemCor), some use 10% of the S3LS training part for mapping (10%-S3LS). Only one system (Duluth) did not use in any way hand-tagged corpora.</Paragraph>
      <Paragraph position="4"> The table shows that Vr opt and Pr fr are more than 6 points above the other unsupervised systems, but given the different typology of unsupervised systems, it's unfair to draw de nitive conclusions from a raw comparison of results. The system coming closer to ours is that described in (Niu et al., 2005). They use hand tagged corpora which does not need to include the target word to tune the parameters of a rather complex clustering method which does use local features. They do use the S3LS training corpus for mapping. For every sense of the target word, three of its contexts in the train corpus are gathered (around 10% of the training data) and tagged. Each cluster is then related with its most frequent sense. The mapping method is similar to ours, but we use all the available training data and allow for different hubs to be assigned to the same sense.</Paragraph>
      <Paragraph position="5"> Another system similar to ours is (Purandare and Pedersen, 2004), which unfortunately was evaluated on Senseval 2 data and is not included in the table. The authors use rst and second order bag-of-word context features to represent each instance of the corpus. They apply several clustering algorithms based on the vector space model, limiting the number of clusters to 7. They also use all available training data for mapping, but given their small number of clusters they opt for a one-to-one mapping which maximizes the assignment and discards the less frequent clusters. They also discard some dif cult cases, like senses and words with low frequencies (10% of total occurrences and 90, respectively). The different test set and mapping system make the comparison dif cult, but the fact that the best of their combinations beats MFS by 1 point on average (47.6% vs. 46.4%) for the selected nouns and senses make us think that our results are more robust (nearly 10% over MFS).</Paragraph>
    </Section>
    <Section position="2" start_page="589" end_page="589" type="sub_section">
      <SectionTitle>
5.2 Clustering evaluation
</SectionTitle>
      <Paragraph position="0"> The three columns corresponding to fully unsupervised evaluation in Table 2 show that all our 3 optimized variants easily outperform the MFS baseline. The best results are in this case for the optimized Veronis, followed closely by Pagerank with frequency threshold.</Paragraph>
      <Paragraph position="1"> The comparison with the supervised and unsupervised systems shows that our system gets better entropy and purity values, but worse FScore. This can be explained by the bias of entropy and purity towards smaller and more numerous clusters. In fact the 1ex-1hub baseline obtains the best entropy and purity scores. Our graph-based system tends to induce a large number of senses (with averages of 60 to 70 senses). On the other hand FScore penalizes the systems inducing a different number of clusters. As the supervised and unsupervised systems were designed to return the same (or similar) number of senses as in the gold standard, they attain higher FScores. This motivated us to compare the results of the best parameters across evaluation methods.</Paragraph>
    </Section>
    <Section position="3" start_page="589" end_page="590" type="sub_section">
      <SectionTitle>
5.3 Comparison across evaluation methods
</SectionTitle>
      <Paragraph position="0"> Table 3 shows all 16 evaluation possibilities for each variant of the algorithm, depending of the evaluation criteria used in S2LS (in the rows) and the evaluation criteria used in S3LS (in the columns). This table shows that the best results (in bold for each variant) tend to be in the diagonal,  that is, when the same evaluation criterion is used for optimization and test, but it is not decisive. If we take the rst row (supervised evaluation) as the most credible criterion, we can see that optimizing according to entropy and purity get similar and sometimes better result (Pr fr and Pr fx). On the contrary the Fscore yields worse results by far.</Paragraph>
      <Paragraph position="1"> This indicates that a purely unsupervised system evaluated according to the gold standard (based on entropy or purity) yields optimal parameters similar to the supervised (mapped) version. This is an important result, as it shows that the quality in performance does not come from the mapping step, but from the algorithm and optimal parameter setting. The table shows that optimization on purity and entropy criteria do correlate with good performance in the supervised evaluation. null The failure of FScore based optimization, in our opinion, indicates that our clustering algorithm prefers smaller and more numerous clusters, compared to the gold standard. FScore prefers clustering solutions that have a similar number of clusters to that of the gold standard, but it is unable to drive the optimization or our algorithm towards good results in the supervised evaluation.</Paragraph>
      <Paragraph position="2"> All in all, the best results are attained with smaller and more numerous hubs, a kind of microsenses. This effect is the same for all three variants tried and all evaluation criteria, with Fscore yielding less clusters. At rst we were uncomfortable with this behavior, so we checked whether HyperLex was degenerating into a trivial solution.</Paragraph>
      <Paragraph position="3"> This was the main reason to include the 1ex-1hub baseline, which simulates a clustering algorithm returning one hub per example, and its precision was 40.1, well below the MFS baseline. We also realized that our results are in accordance with some theories of word meaning, e.g. the indefinitely large set of prototypes-within-prototypes envisioned in (Cruse, 2000). Ted Pedersen has also observed a similar behaviour in his vector-space model clustering experiments (PC). We now think that the idea of having many micro-senses is attractive for further exploration, specially if we are able to organize them into coarser hubs in future work.</Paragraph>
    </Section>
    <Section position="4" start_page="590" end_page="591" type="sub_section">
      <SectionTitle>
5.4 S3AW task
</SectionTitle>
      <Paragraph position="0"> In the Senseval-3 all-words task (Snyder and Palmer, 2004) all words in three document ex-Sup. Unsupervised Alg. Opt. Rec. Entr. Pur. FS  most frequent baseline and the top three supervised systems cerpts need to be disambiguated. Given the scarce amount of training data available in Semcor (Miller et al., 1993), supervised systems barely improve upon the simple most frequent heuristic. In this setting the unsupervised evaluation schemes are not feasible, as many of the target words occur only once, so we used the mapping strategy with Semcor to produce the required WordNet senses in the output.</Paragraph>
      <Paragraph position="1"> Table 4 shows the results for our systems with the best parameters according to the supervised criterion on S2LS, plus the top three S3AW supervised systems and the most frequent sense heuristic. In order to focus the comparison, we only kept noun occurrences of all systems and ltered out multiwords, target words with two different lemmas and unknown tags, leaving a total of 857 occurrences of nouns. We can see that Pr fr is only 0.2 from the S3AW winning system, demonstrating that our unsupervised graph-based systems that use Semcor for mapping are nearly equivalent to the most powerful supervised systems to date.</Paragraph>
      <Paragraph position="2"> In fact, the differences in performance for the systems are not statistically signi cant (McNemar's test at 95% signi cance level).</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML