File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/w97-0321_metho.xml

Size: 19,223 bytes

Last Modified: 2025-10-06 14:14:43

<?xml version="1.0" standalone="yes"?>
<Paper uid="W97-0321">
  <Title>Word Sense Disambiguation Based on Structured Semantic Space*</Title>
  <Section position="3" start_page="187" end_page="188" type="metho">
    <SectionTitle>
2 Semantic Space
</SectionTitle>
    <Paragraph position="0"> In general, a word may have several senses and may appear in several different kinds of contexts. From a point of empirical view, we suppose that each sense of a word is corresponded with a particular kind of context it appears, and the similarity between word senses can be measured by their corresponding contexts. For a particular kind of language, we regard its semantic space as the set of all word senses of the language, with similarity relation between them.</Paragraph>
    <Paragraph position="1"> Now that word senses are in accordance with their contexts, we use the contexts to model word senses. Due to the unavailability of large~-scale i Because the senses in the semantic space are of mono-sense words, we don't distinguish &amp;quot;words&amp;quot; from &amp;quot;senses&amp;quot; strictly here.</Paragraph>
    <Paragraph position="2">  sense-tagged corpus, we try to outline the semantic space by only taking into consideration the mona-sense words, instead of locating all word senses in the space.</Paragraph>
    <Paragraph position="3"> In order to formally represent word senses, we formalize the notion of context as multidimensional real-valued vectors. For any word, we first annotate its neighboring words within certain distances in the corpus with all of their semantic codes in a thesaurus respectively, then make use of such codes and their salience with respect to the word to formalize its contexts. Suppose w is a mona-sense word, and there are n occurrences of the word in a corpus, i.e., wt, we .... , w,, (1) lists their neighbor words within d word distances respectively.</Paragraph>
    <Paragraph position="4"> (l) al,.d, al,-(d.l), ..., al,-i az-a, az-ca-O .... , az.t an,~, a,,-(d-O ..... an, a WI at, l, al,2, ..., al,d we azt, az2 ..... aza Wn an, l, an, e, ..., an, d Suppose Cr is the set of all the semantic codes defined in a thesaurus, for any occurrence wt, 1_&lt; i.~_n, let NCi. be the set of all the semantic codes of its neighboring words which are given in the thesaurus, for any c ~ Cr, we define its salience with respect to w, denoted as Sal(c, w), as (2).</Paragraph>
  </Section>
  <Section position="4" start_page="188" end_page="190" type="metho">
    <SectionTitle>
(2) Sal(c, w)-
Itw, Nc,)I
</SectionTitle>
    <Paragraph position="0"> n So we can build a context vector for w as (3), denoted as CVw, whose dimension is \[Cr\[. Chinese thesaurus, iii) a Chinese corpus consisting of 80 million Chinese characters. In the Chinese dictionary, 37,824 words have only one sense, among which only 27,034 words occur in the corpus, we select 15,000 most frequent mona=sense words in the corpus to build the semantic space for Chinese. In the Chinese thesaurus, the words are divided into 12 major classes, 94 medium classes and 1428 minor classes respectively, and each class is given a semantic code, we select the semantic codes for the minor classes to formalize the contexts of the words. So k=\[ CT\[ =1428.</Paragraph>
    <Paragraph position="1">  3. Structure of Semantic Space * Due to the similarity/dissimilarity relation between  word senses, those in the semantic space cannot be distributed in an uniform way. We suppose that the senses form some clusters, and the senses in each cluster are similar with each other. In order to make out the clusters, we first construct a dendrogram of the senses based on their similarity, then make use of an heuristic strategy to select some appropriate nodes in the dendrogram which most likely correspond with the clusters.</Paragraph>
    <Paragraph position="2"> Now that word senses occur in accordance with their contexts, we measure their similarity /dissimilarity by their contexts. For any two senses st, seeS, let cvt=(xt xe ... xk), cve=( Yt Ye ... yk) be their context vectors respectively, we define the distance between st and se; denoted as dis(st, se), based on the cosine of the angle between the two vectors.</Paragraph>
    <Paragraph position="3">  (4) dis(st, s2)=l-cos(cvt, cv2) 2 (3) cvw=&lt;Sal(ct, w), Sal(c2, w) ..... Sal(ck, w)&gt;  where k= I CTI.</Paragraph>
    <Paragraph position="4"> When building the semantic space for Chinese language, we make use of the following resources, i) Xiandai Hanyu Cidian(1978), a Modem Chinese Dictionary, ii) Tongyici Cilin(Mei et al, 1983), a Obviously, dis(st, s2) is a normalized coefficient: its value ranges from 0 to 1.</Paragraph>
    <Paragraph position="5"> Suppose S is the set of the mona-senses in the  I V l~i~k l~i~k semantic space, for any sense si~S, we create a preliminary node dj, and let Idjl=l, which denotes the number of senses related with d~. Suppose D be the set of all preliminary nodes, the following is the algorithm to construct the dendrogram.</Paragraph>
    <Paragraph position="6">  i) select dr and d2 among all in D, whose distance is the smallest; ii) merge dl and d2 into a new node d, and let Idl=l dl I+l de l; iii) remove dr and d2 from D, and put d into D; iv) compute the context vector of d based on the vectors of dl and d/; v) go to i) until there is only one node; end; Obviously, the algorithm is a bottom-up merging procedure. In each step, two closest nodes are selected and merged into a new one. In (n-1)th step, where n is the number of word senses in S, a final node is produced. The complexity of the algorithm is O(n 3) when implementing it directly, but can be reduced to O(n 2) by sorting the distances between all nodes in each previous step.</Paragraph>
    <Paragraph position="7"> Fig. 1 is a sub-tree of the dendrogram we build for Chinese. It contains six mono-sense words, whose English correspondences are sad, sorrowful, etc. In the sub-tree, we mark each non-preliminary node with the distance between the two merged subnodes, which we also refer to as the weight of the node.</Paragraph>
    <Paragraph position="8"> It can be proved that the distances between the merged nodes in earlier merging steps are smaller than those in later merging steps 4. According to the similarity/dissimilarity relation between the senses, there should exist a level across the dendrogram such that the weights of the nodes above it are bigger, while the weights of the nodes below it smaller, in other words, the ratio between the mean weight of the nodes above the level and that of the nodes below the level is the biggest. Furthermore we suppose that the nodes immediately below the level correspond with the clusters of similar senses. So, in  order to make out the sense clusters, we only need to determine the level.</Paragraph>
    <Paragraph position="9"> Unfortunately, the complexity of determining such a level is exponential to the edges in the dendrogram, which demonstrates that the problem is hard. So we adopt an heuristic strategy to determine an optimal level.</Paragraph>
    <Paragraph position="10"> Suppose T is the dendrogram, sub_T is the sub-tree of T, which takes the same root as T, we also use T and sub T to denote the sets of non-preliminary nodes in T and in sub_T respectively, for any de T, let Wei(d) be the weight of the node d, we define an object function, as (5):  - /JZ- ub /* v.L is not a sense cluster */ else Tee-- Tc ~ {v.L}; /* v.L is a sense cluster */ if Obj(sub T+v.R)&gt;Obj(sub T) then Clustering(v.R, sub T) /* v.R is not a sense cluster */ else TcC/-- Tc u {v.R};</Paragraph>
    <Paragraph position="12"> The algorithm is a depth-first search procedure. Its complexity is O(n), where n is the number of the leaf nodes in the dendrogram, i.e., the number of the mono-sense words in the semantic space.</Paragraph>
    <Paragraph position="13"> When building the dendrograrn for the Chinese semantic space, we found 726 sense clusters in the space. The distribution of the senses in the clusters is demonstrated in Table 1.</Paragraph>
    <Paragraph position="14"> where the numerator is the mean weight of the nodes in sub_T, while the denominator is the mean weight of the nodes in T-sub_T.</Paragraph>
    <Paragraph position="15"> In order to specify the sense clusters, we only need to determine a sub-tree of T which makes (5) get its biggest value. We adopt depth-first search strategy to determine the sub-tree. Suppose vo is the root ofT, for any veT, we use v.L andv.R to denote its two sub-nodes, let Tc be the set of all the nodes corresponding with the sense clusters, we can get Tc by Clustering(vo, Nlff) calling the following procedure.</Paragraph>
  </Section>
  <Section position="5" start_page="190" end_page="192" type="metho">
    <SectionTitle>
4. Disambiguation Procedure
</SectionTitle>
    <Paragraph position="0"> Given a word in some context, we suppose that some clusters in the space can be activated by the context, which reflects the fact that the contexts of the clusters are similar with the given context. But the given context may contain much noise, so there may be some activated clusters in which the senses are not similar with the correct sense of the word in the given context. But due to the fact that the given context can suggest the correct sense of the word, there should be clusters, among all activated ones, in which the senses are similar with the correct sense.</Paragraph>
    <Paragraph position="1">  To make out these clusters, we make use of the definitions of the words in the Modem Chinese Dictionary, and determine the correct sense of the word in the context by measuring the similarity between their definitions.</Paragraph>
    <Paragraph position="2"> percent 1</Paragraph>
    <Section position="1" start_page="191" end_page="192" type="sub_section">
      <SectionTitle>
4.1 Activation
</SectionTitle>
      <Paragraph position="0"> Given a word w in some context, we consider the context as consisting of n words to the left of the word, i.e., w.,,, w.(,. 0 .... , w.l and n words to the right of the word, i.e., wl, w2, w3 ..... w,,. We make use of the semantic codes given in the Chinese thesaurus to ............ ,- .... ,,nnnno/_ lnno/^ 100%</Paragraph>
      <Paragraph position="2"> create a context vector to formally model the context,. Suppose NCw be the set of all semantic codes of the words in the context, then cvw=&lt;x~, x2 ..... xp, where if c~eNC,, then x~=l; otherwise x~=O.</Paragraph>
      <Paragraph position="3"> For any cluster clu in the space, let cvau be its context vector, we also define its distance from w based on the cosine of the angle between their context vectors as (6).</Paragraph>
      <Paragraph position="4"> (6) disl(clu, w)=l-cos(cvdu, cvw) We say clu is activated, if disl(clu, w).~dl, where dt is a threshold. Here we don't define the activated cluster as the one which makes disl(clu, w) smallest, this is because that the context may contain much noise, and the senses in the cluster which makes disj(clu, w) smallest may not be similar with the very sense of the word in the context.</Paragraph>
      <Paragraph position="5"> To estimate a reasonable value for dl, we can compute the distance between the context vector of each mono-sense word occurrence in the corpus and the context vector of the cluster containing the word, then select a reasonable value ford1 based on these distances as the threshold. Suppose CLU is the set of all sense clusters in the space, O is the set of all occurrences of the mono-sense word in the corpus, for any weO, let cluw be the sense cluster containing the sense in the space, we compute all distances dist(cluw, w), for all weO. It should be the case that most values for disl(cluw, w) will be smaller than a threshold, but some will be bigger, even close to 1, this is because most contexts in which the mono-sense words occur would contain meaningful words for the senses, while other contexts contain much noise, and less words, even no words in the contexts are meaningful for the senses.</Paragraph>
      <Paragraph position="6"> When estimating the parameter di for the Chinese semantic space, we let n=5, i.e., we only take 5 words to the left or the right of a word as its context. Fig. 2 demonstrates the distribution of the values of disl(cluw, w), where X axle denotes the  distance, and Y axle denotes the percent of the distances whose values are smaller than x~\[0, 1\] among all distances. We produce a function fix) to model the distribution based on commonly used smoothing tools and locate its inflection point by settingf&amp;quot;(x)=0. Finally we get x=0.378, and let it be the threshold dr.</Paragraph>
    </Section>
    <Section position="2" start_page="192" end_page="192" type="sub_section">
      <SectionTitle>
4.2 Definition-Based Disambiguation
</SectionTitle>
      <Paragraph position="0"> Given a word w in some context c, suppose CLU~ is the set of all the clusters in the semantic space activated by the context, the problem is to determine the correct sense of the word in the context, among all of its senses defined in the modem Chinese dictionary.</Paragraph>
      <Paragraph position="1"> The activation of the clusters in CLUw by the context c demonstrates that c is similar with the contexts of the clusters in CLUw, so there should be at least one cluster in CLU~, in which the senses are similar with the correct sesne of w in c. On the other hand, now that the senses in a cluster are similar in meaning, their definitions in the dictionary should contain similar words, which can be characterized as holding the same semantic codes in the thesaurus.</Paragraph>
      <Paragraph position="2"> So the definitions of all the words in the clusters contain strong and meaningful information about the very sense of the word in the context.</Paragraph>
      <Paragraph position="3"> We first construct two definition vectors to model the definitions of all the words in a cluster and the definitions of w based on the semantic codes of the definition words 6, then determine the sense of w in the context by measuring the similarity between each definition of w and the definitions of all the words in a cluster.</Paragraph>
      <Paragraph position="4"> For any clu~CLU~, suppose clu={wJ l_&lt;ig_n}, let C~ be the set of all semantic codes of all the words in w;s definition, CT be defined as above, i.e., the set of all the semantic codes in the thesaurus, for any CeCT, we define its salience with respect to clu, denoted as sal(c, clu), as (7).</Paragraph>
      <Paragraph position="5"> 6 The words in the definitions are called definition words. Suppose Sw is the set ofw's senses defined in the dictionary, for any sense s~Sw, let Cs be the set of all the semantic codes of its definition words, we call dvs=&lt;xl, x2 ..... xk&gt; definition vector of s, where for all i, ifci~C,, x~=l; otherwise x~=0.</Paragraph>
      <Paragraph position="6"> We define the distance between an activated cluster in the semantic space and the sense of a word as (9) again in terms of the cosine of the angle between their definition vectors.</Paragraph>
      <Paragraph position="7"> (9) dis2(clu, s)=l-cos(dvau, dv,) Intuitively the distance can be seen as a measure of the similarity between the definitions of the words in the cluster and each definition of the word. Compared with the distance defined in (6), this distance is to measure the similarity between definitions, while the distance in (6) is to measure the similarity between contexts.</Paragraph>
      <Paragraph position="8"> Thus it is reasonable to select the sense s* among all as the correct one in the context, such that there exists clu'~CLUw, and dis2(clu*, s*) gets the smallest value as (10), for clu~CLUw, and s~Sw.</Paragraph>
      <Paragraph position="9"> (1 O) MIN dis 2 ( clu, s) cIueCLU w ,s~.S,,</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="192" end_page="193" type="metho">
    <SectionTitle>
5. Experiments and Results
</SectionTitle>
    <Paragraph position="0"> In order to evaluate the application of the Chinese semantic space to WSD tasks, we make use of another Chinese lexical resource, i.e., Xiandai Hanyu Cihai (Zhang et al., 1994), a Chinese collocation dictionary. The sense distinctions in the dictionary are the same as those in the modem Chinese dictionary, and for each sense in the  collocation dictionary, some words are listed as its collocations. We see these collocations as the contexts of the word senses, and evaluate our algorithm automatically. We randomly select 40 ambiguous words contained in the dictionary, and there are altogether 1240 words listed as their collocations. Table 2 lists the distribution of the number of the sense clusters activated by these collocation words.</Paragraph>
    <Paragraph position="1"> Table 3 lists the distribution of the smallest distances between the word senses and the activated clusters, and the accuracy of the disambiguation. From Table 3, we can see that smaller distances between the senses and the activated clusters mean higher accuracy of disambiguation.</Paragraph>
    <Paragraph position="2"> Number of activated Number of collocations  of activated clusters occurrences of the word in the corpus, and implement our algorithm on them respectively. The result is 66 occurrences are tagged with the second sense (6 occurrences wrongly tagged), and the others tagged with the first sense (2 occurrences wrongly tagged). The overall accuracy is 92%. To examine the reasonableness of the result, we formalize four context vectors again based on semantic codes to represent the contexts of four groups of the occurrences: cvl: the context of the 60 occurrences correctly tagged with the second sense; cv2: the context of the 6 occurrences wrongly tagged with the second sense; cv3: the context of the 32 occurrences correctly tagged with the first sense; cv4: the context of the 2 occurrences wrongly tagged with the first sense; The distances between these vectors are listed in  disambiguation accuracy In another experiment, we examine the ambiguous Chinese word --0-~ (/danbo/7), it has two senses, one is less clothes taken by a man, the other is thin and weak. We randomly select 100</Paragraph>
  </Section>
  <Section position="7" start_page="193" end_page="194" type="metho">
    <SectionTitle>
7 The Piyin of the word.
</SectionTitle>
    <Paragraph position="0"> the four groups From Table 4, we find that both the distance between cvl and cv4 and that between cv2 and cv3 are very high, which reflects the fact that they are not similar with each other. This demonstrates that one main reason for tagging errors is that the considered contexts of the words contain less meaningful information for determining the correct senses.</Paragraph>
    <Paragraph position="1"> In third experiment, we implement our algorithm on 100 occurrences of the ambiguous word f~l~ (/bianji/), it also has two senses, one is editor, the other is to edit. We find the tagging accuracy is very low. To explore the reason for the errors, we  compute the distances between its definitions and those of the words in the activated clusters, and find that the smallest distances fall in \[0.34, 0.87\]. This demonstrates that another main reason for the tagging errors is the sparseness of the clusters in the space.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML