File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/96/w96-0105_metho.xml

Size: 24,509 bytes

Last Modified: 2025-10-06 14:14:25

<?xml version="1.0" standalone="yes"?>
<Paper uid="W96-0105">
  <Title>Selective Sampling of Effective Example Sentence Sets for Word Sense Disambiguation</Title>
  <Section position="4" start_page="57" end_page="60" type="metho">
    <SectionTitle>
2 Example-based verb sense disambiguation system
</SectionTitle>
    <Paragraph position="0"/>
    <Paragraph position="2"> Our method for disambiguating verb senses uses a database containing examples of collocations for each verb sense and its associated case frame(s). Figure 2 shows a fragment of the entry associated with the Japanese verb toru. As with most words, the verb toru has multiple senses, a sample of which are &amp;quot;to take/steal,&amp;quot; &amp;quot;to attain,&amp;quot; &amp;quot;to subscribe&amp;quot; and &amp;quot;to reserve.&amp;quot; The database specifies the case frame(s) associated with each verb sense. In Japanese, a complement of a verb consists of a noun phrase (case filler) and its case marker suffix, for example ga (nominative) or o (accusative). The database lists several case filler examples for each case.</Paragraph>
    <Paragraph position="3"> The task of the system is &amp;quot;to interpret&amp;quot; the verbs occurring in the input text, i.e. to choose one sense from among a set of candidates. All verb senses we use are defined in &amp;quot;IPAL&amp;quot; \[7\], a machine readable dictionary. IPAL also contains example case fillers as shown in figure 2.</Paragraph>
    <Paragraph position="4"> Given an input, in our case a simple sentence, the system identifies the verb sense on the basis of the scored similarity between the input and the examples given for each verb sense. Let us take as an example the sentence below: hisho ga shindaisha o toru.</Paragraph>
    <Paragraph position="5"> (secretary-NOM) (sleeping car-ACC) (?) In this example, one may consider hisho (&amp;quot;secretary&amp;quot;) and shindaisha (&amp;quot;sleeping car&amp;quot;) to be semantically similar to joshu (&amp;quot;assistant&amp;quot;) and hikSki (&amp;quot;airplane&amp;quot;) respectively, and since both collocate with the &amp;quot;to reserve&amp;quot; sense of toru one could infer that toru may be interpreted as &amp;quot;to reserve.&amp;quot; The similarity between two different case fillers is estimated according to the length of the path between them in a thesaurus. Our current experiments are based around the Japanese word thesaurus Bunruigoihyo \[17\]. Figure 3 shows a fragment of Bunruigoihyo including some of the nouns in both figure 2 and the example sentence above, with each word corresponding to a leaf in the structure of the thesaurus. As with most thesauri, the length of the path between two terms in Bunruigoihyo is expected to reflect their relative similarity. In  table 1, we show our measure of similarity, based on the length of the path between two terms, as proposed by Kurohashi et al \[12\].</Paragraph>
    <Paragraph position="7"> Furthermore, since the restrictions imposed by the case fillers in choosing the verb sense are not equally selective, we consider a weighted case contribution to the disambiguation (CCD) of the verb senses. This CCD factor is taken into account when computing the score of a verb's sense. Consider again the case of toru in figure 2. Since the semantic range of nouns collocating with the verb in the nominative does not seem to have a strong delinearization in a semantic sense (in figure 2, the nominative of each verb sense displays the same general concept, i.e.</Paragraph>
    <Paragraph position="8"> animate), it would be difficult, or even risky, to properly interpret the verb sense based on the similarity in the nominative. In contrast, since the ranges are diverse in the accusative, it would be feasible to rely more strongly on the similarity here. This argument can be illustrated as in figure 4, in which the symbols &amp;quot;1&amp;quot; and &amp;quot;2&amp;quot; denote example case fillers of different case frames, and an input sentence includes two case fillers denoted by &amp;quot;x&amp;quot; and &amp;quot;y.&amp;quot;  The figure shows the distribution of example case fillers for the respective case frames, denoted in a semantic space. The semantic similarity between two given case fillers is represented by the physical distance between two symbols. In the nominative, since &amp;quot;x&amp;quot; happens to be much closer to a &amp;quot;2&amp;quot; than any &amp;quot;1,&amp;quot; &amp;quot;x&amp;quot; may be estimated to belong to the range of &amp;quot;2&amp;quot;s, although &amp;quot;x&amp;quot; actually belongs to both sets of &amp;quot;l&amp;quot;s and &amp;quot;2&amp;quot;s. In the accusative, however, &amp;quot;y&amp;quot; would be properly estimated to belong to the set of &amp;quot;l&amp;quot;s due to the mutual independence of the two accusative case filler sets, even though examples did not fully cover each of the ranges of &amp;quot;l&amp;quot;s and &amp;quot;2&amp;quot;s. Note that this difference would be critical if example data were sparse. We will explain the method used to compute CCD later in this section.</Paragraph>
    <Paragraph position="9">  To illustrate the overall algorithm, we will consider an abstract specification of both input and the datatbase (see figure 5). Let the input be {ncl-mcl, nc2-mc2, nc3-mc3, v}, where nei denotes the case filler for the case ci, and mci denotes the case marker for ci. The interpretation candidates for v are derived from the database as sl, 82 and s3. The database contains also a set PS8i,cj of case filler examples for each case cj of each sense 8i (&amp;quot;--&amp;quot; indicates that the corresponding case is not allowed).</Paragraph>
    <Paragraph position="10">  in Bunruigoihyo and their relative similarity (sire(X, Y))</Paragraph>
    <Paragraph position="12"> During the verb sense disambiguation process, the system discards first those candidates whose case frame does not fit the input. In the case of figure 5, s3 is discarded because the case frame of v (8a) does not subcategorize for the case cl.</Paragraph>
    <Paragraph position="13"> In the next step the system computes the score of the remaining candidates and chooses as the most plausible interpretation the one with the highest score. The score of an interpretation is computed by considering the weighted average of the similarity degrees of the input complements with respect to each of the example case fillers (in the corresponding case) listed in the database for the sense under evaluation. Formally, this is expressed by equation (1), where S(s) is the score of the sense s of the input verb, and SIM(nc, gs,c) is the maximum similarity degree between the input complement nc and the corresponding complements in the database example PSs,c (equation (2)).</Paragraph>
    <Paragraph position="15"> In equation (2), sim stands for the similarity degree between nc and an example case filler e as given by table 1.</Paragraph>
    <Paragraph position="16"> CCD(c) expresses the weight factor of the case c contribution to the (current) verb sense disambiguation. Intuitively preference should be given to cases displaying case fillers which are classified in semantic categories of greater independence. Let v be a verb with n senses (81, 82,..., 8n) and let PSsi,c be the set of example case fillers for the case c, associated with the sense si. Then, c's contribution to v's sense disambiguation, CCD(c), is likely to be higher if the example case filler sets {gsi,c I i = 1,..., n} share less elements. The notion of sharing is defined based on the similarity as in equation (3).</Paragraph>
    <Paragraph position="18"> With these definitions, CCD(c) is given by equation (4).</Paragraph>
    <Paragraph position="20"> Where a is the constant for parameterizing the extent to which CCD influences verb sense disambiguation. The larger a, the stronger CCD's influence on the system's output.</Paragraph>
  </Section>
  <Section position="5" start_page="60" end_page="63" type="metho">
    <SectionTitle>
3 Example sampling algorithm
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="60" end_page="61" type="sub_section">
      <SectionTitle>
3.1 Overview
</SectionTitle>
      <Paragraph position="0"> Let us look again at figure I in section 1. In this diagram, &amp;quot;outputs&amp;quot; refers to a corpus in which each sentence is assigned the proper interpretation of the verb during the execution phase. In the &amp;quot;training&amp;quot; phase, the system stores samples of manually disambiguated verb senses (simply checked or appropriately corrected by a human) in the database to be later used in a new execution phase. This is the issue we turn to in this section.</Paragraph>
      <Paragraph position="1"> Lewis et al. proposed the notion of uncertain example sampling for the training of statistics-based text classifiers \[13\]. Their method selects those examples that the system classifies (in this case, matching a text category) with minimum certainty. This method is based on the assumption that there is no need for teaching the system the correct answer when it answered with high certainty. However, we should take into account the training effect a given example has on other examples. In other words, by selecting an appropriate example as a sample, we can get more correct examples in the next cycle of iteration. In consequence, the number of examples to be taught will decrease. We consider maximization of this effect by means of a training utility function (TUF) aiming at ensuring that the example with the highest training utility figure, is the most useful example at a given point in time.</Paragraph>
      <Paragraph position="2"> Let S be a set of sentences, i.e. a given corpus, and T be a subset of S in which each sentence has already been manually disambiguated for training. In other words, sentences in T have been selected as samples, and are hence stored in the database. Let X be the set of the residue, realizing equation (5).</Paragraph>
      <Paragraph position="4"> We introduce a utility function TUF(x), which computes the training utility figure for an example x. The sampling algorithm gives preference to examples of maximum utility, by way of equation (6).</Paragraph>
      <Paragraph position="5"> arg max TUF(x) (6) xEX We will explain in the following sections how one could estimate TUF, based on the estimation of the certainty figure of an interpretation. Ideally the sampling size, i.e. the number of samples selected at each iteration would be such as to avoid retraining of similar examples. It should be noted that this can be a critical problem for statistics-based approaches \[1, 3, 18, 20, 24\], as the reconstruction of statistic classifiers is expensive. However, example-based systems \[5, 12, 21\] do not require the reconstruction of the system, but examples have to be stored in the database. It also should be noted that in each iteration, the system needs only compute the similarity between each example x belonging to X and the newly stored example, instead of every example belonging to T, because of the following reasons: * storing an example of verb sense interpretation si, will not affect the score of other verb senses,  * if the system memorizes the current score of si for each x, the system simply needs to compare it with the newly computed score between x and the newly stored example in T and choose the greater of the two to be the new plausibility of si.</Paragraph>
      <Paragraph position="6"> This reduces the time complexity of each iteration from O(N 2) to O(N), given that N is the total number of examples in S.</Paragraph>
    </Section>
    <Section position="2" start_page="61" end_page="62" type="sub_section">
      <SectionTitle>
3.2 Interpretation certainty
</SectionTitle>
      <Paragraph position="0"> Lewis et al. estimate certainty of an interpretation by the ratio between the probability of the most plausible text category, and the probability of any other text category, excluding the most probable one. Similarly, in our example-based verb sense disambiguation system, we introduce the notion of interpretation certainty of examples based on the following applicability restrictions:  1. the highest interpretation score is sufficiently large, 2. the highest interpretation score is significantly larger than the second highest score.  The rationale for these restrictions is given below. Consider figure 6, where each symbol denotes an example in S, with symbols &amp;quot;x&amp;quot; belonging to X and symbols &amp;quot;e&amp;quot; belonging to T. The curved lines delimit the semantic vicinities (extents) of the two &amp;quot;e&amp;quot;s, i.e. sense 1 and sense 2, respectively 1. The semantic similarity between two sentences is graphically portrayed by the physical distance between the two symbols representing them. In figure 6-a, &amp;quot;x's located inside a semantic vicinity are expected to be interpreted with high certainty as being similar to the appropriate example &amp;quot;e,&amp;quot; a fact which is in line with restriction 1 mentioned above. However, in figure 6-b, the degree of certainty for the interpretation of any &amp;quot;x&amp;quot; which is located inside the intersection of the two semantic vicinities cannot be great. This happens when the case fillers of two or more verb senses are not selective enough to allow a clear cut delineation among them. This situation is explicitly rejected by restriction 2.</Paragraph>
      <Paragraph position="2"> Considering the two restrictions, we compute interpretation certainties by using equation (7), where C(x) is the interpretation certainty of an example x. Sl(x) and S2(x) are the highest  and second highest scores for x, respectively. )~, which ranges from 0 to 1, is a parametric constant to control the degree to which each condition affects the computation of C(x).</Paragraph>
      <Paragraph position="4"> We estimated the validity of the notion of the interpretation certainty through a preliminary experiment, in which we used the same corpus used for another experiment as described in section 4. In this experiment, we conducted a six fold-cross validation, that is, we divided the training/test data into six equal parts, and conducted six trials in which a different part was used as test data each time, and the rest as training data. We shall call these two sets the &amp;quot;test set&amp;quot; and the &amp;quot;training set.&amp;quot; Thereafter, we evaluated the relation between the applicability and the precision of the system.</Paragraph>
      <Paragraph position="5"> In this experiment, the applicability is the ratio between the number of cases where the certainty of the system's interpretation of the outputs is above a certain threshold, and the number of inputs. The precision is the ratio between the number of correct outputs, and the number of inputs. Increasing the value of the threshold, the precision also increases (at least theoretically), while the applicability decreases. Figure 7 shows the result of the experiment with several values of ~, in which the optimal ~ value seems to be in the range 0.25 to 0.5.</Paragraph>
      <Paragraph position="6"> It can be seen that, as we assumed, both restrictions are essential for the estimation of the interpretation certainty.</Paragraph>
    </Section>
    <Section position="3" start_page="62" end_page="63" type="sub_section">
      <SectionTitle>
3.3 Training utility
</SectionTitle>
      <Paragraph position="0"> The training utility of an example &amp;quot;a&amp;quot; is greater than that of another example &amp;quot;b&amp;quot; when the total interpretation certainty of examples in X increases more after training using the example &amp;quot;a&amp;quot; than after using the example &amp;quot;b.&amp;quot; Let us consider figure 8, with the basic notation as in figure 6, and let us compare the training utility of the examples &amp;quot;a,&amp;quot; &amp;quot;b&amp;quot; and &amp;quot;c.&amp;quot; Note that in this figure, whatever example we use for training, the interpretation certainty for the  neighbours (&amp;quot;x&amp;quot;s) of the chosen example increases. However, it is obvious that we can increase the total interpretation certainty of &amp;quot;x&amp;quot;s when we use &amp;quot;a&amp;quot; for training as it has more neighbours than either &amp;quot;b&amp;quot; or &amp;quot;c.&amp;quot; In consequence, one can expect that the size of the database, which is directly proportional to the number of training examples, can be decreased. Let AC(x = s, y) be the difference in the interpretation certainty of y E X after training with x E X taken with the sense s. TUF(x=s), which is the training utility function for x taken with sense s, can be computed by equation (8).</Paragraph>
      <Paragraph position="2"> We compute TUF(x) by calculating the average of each TUF(x = s), weighted by the probability that x takes sense s. This can be realized by equation (9), where P(x = s) is the probability that x is used in training with the sense s.</Paragraph>
      <Paragraph position="4"> Given the fact that (a) P(x = s) is difficult to estimate in the current formulation, and (b) the cost of computation for each TUF(x = s) is not trivial, we temporarily approximate TUF(x) as in equation (10), where K is a set of the k-best verb sense(s) of x with respect to the interpretation score in the current state.</Paragraph>
      <Paragraph position="6"/>
    </Section>
  </Section>
  <Section position="6" start_page="63" end_page="65" type="metho">
    <SectionTitle>
4 Evaluation
</SectionTitle>
    <Paragraph position="0"> We compared the performance of our example sampling method with random sampling, in which a certain proportion of a given corpus is randomly selected for training. We compared the two sampling methods by evaluating the relation between various numbers of examples in training, and the performance of the system on another corpus. We conducted a six fold-cross validation as described in section 3.2, but in this experiment, each method selected some proportion of the training set as samples. We used the same corpus as described in table 2 as training/test data. Both sampling methods used examples from IPAL to initialize the system (as seeds) with the number of example case fillers for each case being on average of about 3.7.</Paragraph>
    <Paragraph position="1"> The training/test data used in the experiment contained about one thousand simple Japanese sentences collected from news articles. Each of the sentences in the training/test data used  in our experiment contained one or several complement(s) followed by one of the ten verbs enumerated in table 2. In table 2, the column of &amp;quot;English gloss&amp;quot; describes typical English translations of the Japanese verbs. The column of &amp;quot;# of sentences&amp;quot; denotes the number of sentences in the corpus, &amp;quot;# of senses&amp;quot; denotes the number of verb senses based on IPAL, and &amp;quot;lower bound&amp;quot; denotes the precision gained by using a naive method, where the system systematically chooses the most frequently appearing interpretation in the training data \[6\].</Paragraph>
    <Paragraph position="2">  We at first estimated the system's performance by its precision, that is the ratio of the number of correct outputs, compared to the number of inputs. In this experiment, we set = 0.5 in equation (7), and k = 1 in equation (10). The influence of CCD, i.e. o~ in equation (4), was extremely large so that the system virtually relied solely on the SIM of the case with the greatest CCD.</Paragraph>
    <Paragraph position="3"> Figure 9 shows the relation between the size of the training data and the precision of the system. In figure 9, when the x-axis is zero, the system has used only the seeds given by IPAL. It should be noted that with the final step, where all examples in the training set have been provided to the database, the precision of both methods is equal. Looking at figure 9 one can see that the precision of random sampling was surpassed by our training utility sampling method. It solves the first two problems mentioned in section 1. One can also see that the size of the database can be reduced without degrading the system's precision, and as such it can solve the third problem mentioned in section 1.</Paragraph>
    <Paragraph position="4"> We further evaluated the system's performance in the following way. Integrated with other NLP systems, the task of our verb sense disambiguation system is not only to output the most plausible verb sense, but also the interpretation certainty of its output, so that other systems can vary the degree of reliance on our system's output. The following are properties which are required for our system: * the system should output as many correct answers as possible, * the system should output correct answers with great interpretation certainty, * the system should output incorrect answers with diminished interpretation certainty.</Paragraph>
    <Paragraph position="5"> Motivated by these properties, we formulated a new performance estimation measure, PM, as shown in equation (11). A greater accuracy of performance of the system will lead to a greater  In equation (11), Cmax is the maximum value of the interpretation certainty, which can be derived by substituting the maximum and the mimimum interpretation score for Si(x) and S2(x), respectively, in equation (7). Following table 1, we assign 11 and 0 to be the maximum and the minimum of the interpretation score, and therefore Cma~ = 11, disregarding the value of ~ in equation (7). N is the total number of the inputs and 5 is a coefficient defined as in equation (12).</Paragraph>
    <Paragraph position="6"> 1 if the interpratation of x is correct = (12) -p otherwise In equation (12), p is the parametric constant to control the degree of the penalty for a system error. For our experiment, we set p = 1, meaning that PM was in the range -1 to 1.</Paragraph>
    <Paragraph position="7"> Figure 10 shows the relation between the size of the training data and the value of PM. In this experiment, it can be seen that the performance of random sampling was again surpassed by our training utility sampling method, and the size of the database can be reduced without degrading the system's performance.</Paragraph>
  </Section>
  <Section position="7" start_page="65" end_page="65" type="metho">
    <SectionTitle>
5 Discussion
</SectionTitle>
    <Paragraph position="0"> In this section, we will discuss several remaining problems. First, since in equation (8), the system calculates the similarity between x and each example in X, computation of TUF(x = s) becomes time consuming. To avoid this problem, a method used in efficient database search techniques \[9, 22\], in which the system can search some neighbour examples of x with optimal time complexity, can be potentially used.</Paragraph>
    <Paragraph position="1">  Second, there is a problem as to when to stop the training: that is, as mentioned in section 1, it is not reasonable to manually analyze large corpora as they can provide virtually infinite input. One plausibile solution would be to select a point when the increment of the total interpretation certainty of remaining examples in X is not expected to exceed a certain threshold.</Paragraph>
    <Paragraph position="2"> Finally, we should also take the semantic ambiguity of case fillers (noun) into account. Let us consider figure 11, where the basic notation is the same as in figure 6, and one possible problem caused by case filler ambiguity is illustrated. Let &amp;quot;xl&amp;quot; and &amp;quot;x2&amp;quot; denote different senses of a case filler &amp;quot;x.&amp;quot; Following the basis of equation (7), the interpretation certainty of &amp;quot;x&amp;quot; is small in both figure ll-a and ll-b. However, in the situation as in figure ll-b, since (a) the task of distinction between the verb senses 1 and 2 is easier, and (b) instances where the sense ambiguity of case fillers corresponds to distinct verb senses will be rare, training using either &amp;quot;xl&amp;quot; or &amp;quot;x2&amp;quot; will be less effective than as in figure ll-a. It should also be noted that since Bunruigoihyo is a relatively small-sized thesaurus and does not enumerate many word senses, this problem is not critical in our case. However, given other existing thesauri like the EDR electronic dictionary \[4\] or WordNet \[15\], these two situations should be strictly differentiated.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML