File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-2911_metho.xml

Size: 17,661 bytes

Last Modified: 2025-10-06 14:10:53

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-2911">
  <Title>Applying Alternating Structure Optimization to Word Sense Disambiguation</Title>
  <Section position="5" start_page="77" end_page="78" type="metho">
    <SectionTitle>
CM
</SectionTitle>
    <Paragraph position="0"> CU is regularized empirical risk minimization, which minimizes an empirical loss of the predictor (with regularization) on the D2 labeled training examples  CV, and assume that there exists a low-dimensional predictive structure shared by these D1 problems. Ando and Zhang (2005a) extend the above traditional linear model to a joint linear model so that a predictor for problem CO is in the form:</Paragraph>
    <Paragraph position="2"> are weight vectors specific to each problem CO. Predictive structure is parameterized by the structure matrix A2 shared by all the D1 predictors. The goal of this model can also be regarded as learning a common good feature map A2DC used for all the D1 problems. null</Paragraph>
    <Section position="1" start_page="77" end_page="78" type="sub_section">
      <SectionTitle>
2.3 ASO algorithm
</SectionTitle>
      <Paragraph position="0"> Analogous to (1), we compute A2 and predictors so that they minimize the empirical risk summed over all the problems:</Paragraph>
      <Paragraph position="2"> This minimization can be approximately solved by repeating the following alternating optimization procedure until a convergence criterion is met:  Nouns art, authority, bar, bum, chair, channel, child, church, circuit, day, detention, dyke, facility, fatigue, feeling, grip, hearth, holiday, lady, material, mouth, nation, nature, post, restraint, sense, spade, stress, yew Verbs begin, call, carry, collaborate, develop, draw, dress, drift, drive, face, ferret, find, keep, leave, live, match, play, pull, replace, see, serve strike, train, treat, turn, use, wander wash, work Adjectives blind, colourless, cool, faithful, fine, fit, free, graceful, green, local, natural, oblique, simple, solemn, vital  minimizes the joint empirical risk (4).</Paragraph>
      <Paragraph position="3"> The first step is equivalent to training D1 predictors independently. The second step, which couples all the predictors, can be done by setting the rows of A2 to the most significant left singular vectors of the predictor (weight) matrix CD BP CJD9  . That is, the structure matrix A2 is computed so that the projection of the predictor matrix CD onto the subspace spanned by A2's rows gives the best approximation (in the least squares sense) of CD for the given row-dimension of A2. Thus, in- null tuitively, A2 captures the commonality of the D1 predictors. null ASO has been shown to be useful in its semi-supervised learning configuration, where the above algorithm is applied to a number of auxiliary problems that are automatically created from the unlabeled data. By contrast, the focus of this paper is the multi-task learning configuration, where the ASO algorithm is applied to a number of real problems with the goal of improving overall performance on these problems.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="78" end_page="78" type="metho">
    <SectionTitle>
3 Effective use of ASO on word sense
</SectionTitle>
    <Paragraph position="0"> disambiguation The essence of ASO is to learn information useful for prediction (predictive structure) shared by multiple tasks, assuming the existence of such shared structure. From this viewpoint, consider the target words of the Senseval-2 lexical sample task, shown in Figure 1. Here we have multiple disambiguation tasks; however, at a first glance, it is not entirely clear whether these tasks share predictive structure (or are related to each other). There is no direct semantic relationship (such as synonym or hyponym relations) among these words.</Paragraph>
    <Paragraph position="1"> word uni-grams in 5-word window, Local word bi- and tri-grams of B4DB</Paragraph>
  </Section>
  <Section position="7" start_page="78" end_page="79" type="metho">
    <SectionTitle>
CX
</SectionTitle>
    <Paragraph position="0"> stands for the word at position CX relative to the word to be disambiguated. The 5-word window is CJA0BEBNB7BECL. Local context and POS features are positionsensitive. Global context features are position insensitive (a bag of words).</Paragraph>
    <Paragraph position="1"> The goal of this section is to empirically study the effective use of ASO for improving overall performance on these seemingly unrelated disambiguation problems. Below we first describe the task setting, features, and algorithms used in our implementation, and then experiment with the Senseval-2 English lexical sample data set (with the official training / test split) for the development of our methods. We will then evaluate the methods developed on the Senseval-2 data set by carrying out the Senseval-3 tasks, i.e., training on the Senseval-3 training data and then evaluating the results on the (unseen) Senseval-3 test sets in Section 4.</Paragraph>
    <Paragraph position="2"> Task setting In this work, we focus on the Senseval lexical sample task. We are given a set of target words, each of which is associated with several possible senses, and their labeled instances for training. Each instance contains an occurrence of one of the target words and its surrounding words, typically a few sentences. The task is to assign a sense to each test instance.</Paragraph>
    <Paragraph position="3"> Features We adopt the feature design used by Lee and Ng (2002), which consists of the following four types: (1) Local context: D2-grams of nearby words (position sensitive); (2) Global context: all the words (excluding stopwords) in the given context (position-insensitive; a bag of words); (3) POS: parts-of-speech D2-grams of nearby words; (4) Syn- null tactic relations: syntactic information obtained from parser output. To generate syntactic relation features, we use the Slot Grammar-based full parser ESG (McCord, 1990). We use as features syntactic relation types (e.g., subject-of, object-of, and noun modifier), participants of syntactic relations, and bi-grams of syntactic relations / participants. Details of the other three types are shown in Figure 2.</Paragraph>
    <Paragraph position="4"> Implementation Our implementation follows Ando and Zhang (2005a). We use a modification of the Huber's robust loss for regression:</Paragraph>
    <Paragraph position="6"> otherwise; with square regularization (AL BP BDBC A0BG), and perform empirical risk minimization by stochastic gradient descent (SGD) (see e.g., Zhang (2004)). We perform one ASO iteration.</Paragraph>
    <Section position="1" start_page="79" end_page="79" type="sub_section">
      <SectionTitle>
3.1 Exploring the multi-task learning
</SectionTitle>
      <Paragraph position="0"> configuration The goal is to effectively apply ASO to the set of word disambiguation problems so that overall performance is improved. We consider two factors: feature split and partitioning of prediction problems. 3.1.1 Feature split and problem partitioning Our features described above inherently consist of four feature groups: local context (C4BV), global context (BZBV), syntactic relation (CBCA), and POS features. To exploit such a natural feature split, we explore the following extension of the joint linear model:  ) is a portion of the feature vector DC (or the weight vector DA CO ) corresponding to the feature group CY, respectively. This is a slight modification of the extension presented in (Ando and Zhang, 2005a). Using this model, ASO computes the structure matrix A2</Paragraph>
      <Paragraph position="2"> ture group separately. That is, SVD is applied to the sub-matrix of the predictor (weight) matrix corresponding to each feature group CY, which results in more focused dimension reduction of the predictor matrix. For example, suppose that BY BP CUCBCACV. Then, we compute the structure matrix A2</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="79" end_page="79" type="metho">
    <SectionTitle>
CBCA
from
</SectionTitle>
    <Paragraph position="0"> the corresponding sub-matrix of the predictor matrix CD, which is the gray region of Figure 3 (a). The structure matrices A2</Paragraph>
  </Section>
  <Section position="9" start_page="79" end_page="80" type="metho">
    <SectionTitle>
CY
</SectionTitle>
    <Paragraph position="0"> for CY BPBE BY (associated with the white regions in the figure) should be regarded as being fixed to the zero matrices. Similarly, it is possible to compute a structure matrix from a subset of the predictors (such as noun disambiguators only), as in Figure 3 (b). In this example, we apply the extension of ASO with BY BP CUCBCACV to three sets of problems (disambiguation of nouns, verbs, and adjectives, respectively) separately.</Paragraph>
    <Paragraph position="1">  To see why such partitioning may be useful for our WSD problems, consider the disambiguation of &amp;quot;bank&amp;quot; and the disambiguation of &amp;quot;save&amp;quot;. Since a &amp;quot;bank&amp;quot; as in &amp;quot;money bank&amp;quot; and a &amp;quot;save&amp;quot; as in &amp;quot;saving money&amp;quot; may occur in similar global contexts, certain global context features effective for recognizing the &amp;quot;money bank&amp;quot; sense may be also effective for disambiguating &amp;quot;save&amp;quot;, and vice versa. However, with respect to the position-sensitive local context features, these two disambiguation problems may not have much in common since, for instance, we sometimes say &amp;quot;the bank announced&amp;quot;, but we rarely say &amp;quot;the save announced&amp;quot;. That is, whether problems share predictive structure may depend on feature types, and in that case, seeking predictive structure for each feature group separately may be more effective. Hence, we experiment with the configurations with and without various feature splits using the extension of ASO.</Paragraph>
    <Paragraph position="2"> Our target words are nouns, verbs, and adjectives. As in the above example of &amp;quot;bank&amp;quot; (noun) and &amp;quot;save&amp;quot; (verb), the predictive structure of global context features may be shared by the problems irrespective of the parts of speech of the target words.</Paragraph>
    <Paragraph position="3"> However, the other types of features may be more dependent on the target word part of speech. There- null fore, we explore two types of configuration. One applies ASO to all the disambiguation problems at once. The other applies ASO separately to each of the three sets of disambiguation problems (noun disambiguation problems, verb disambiguation problems, and adjective disambiguation problems) and uses the structure matrix A2</Paragraph>
  </Section>
  <Section position="10" start_page="80" end_page="80" type="metho">
    <SectionTitle>
CY
</SectionTitle>
    <Paragraph position="0"> obtained from the noun disambiguation problems only for disambiguating nouns, and so forth.</Paragraph>
    <Paragraph position="1"> Thus, we explore combinations of two parameters. One is the set of feature groups BY in the model (5). The other is the partitioning of disambiguation problems.</Paragraph>
    <Paragraph position="2">  task configurations varying feature group set BY and problem partitioning. Performance at the best dimensionality of A2</Paragraph>
    <Paragraph position="4"> CUBDBCBNBEBHBNBHBCBNBDBCBCBNA1A1A1CV) is shown.</Paragraph>
    <Paragraph position="5"> In Figure 4, we compare performance on the Senseval-2 test set produced by training on the Senseval-2 training set using the various configurations discussed above. As the evaluation metric, we use the F-measure (micro-averaged)3 returned by the official Senseval scorer. Our baseline is the standard single-task configuration using the same loss function (modified Huber) and the same training algorithm (SGD).</Paragraph>
    <Paragraph position="6"> The results are in line with our expectation. To learn the shared predictive structure of local context (LC) and syntactic relations (SR), it is more advantageous to apply ASO to each of the three sets of problems (disambiguation of nouns, verbs, and adjectives, respectively), separately. By contrast, global context features (GC) can be more effectively exploited when ASO is applied to all the disambigua3Our precision and recall are always the same since our systems assign exactly one sense to each instance. That is, our F-measure is the same as 'micro-averaged recall' or 'accuracy' used in some of previous studies we will compare with.</Paragraph>
    <Paragraph position="7"> tion problems at once. It turned out that the configuration BY BP CUC8C7CBCV does not improve the performance over the baseline. Therefore, we exclude POS from the feature group set BY in the rest of our experiments. Comparison of BY BP CUC4BVB7CBCAB7BZBVCV (treating the features of these three types as one group) and BY BP CUC4BVBNCBCABNBZBVCV indicates that use of this feature split indeed improves performance.</Paragraph>
    <Paragraph position="8"> Among the configurations shown in Figure 4, the best performance (67.8%) is obtained by applying ASO to the three sets of problems (corresponding to nouns, verbs, and adjectives) separately, with the feature split BY BP CUC4BVBNCBCABNBZBVCV.</Paragraph>
    <Paragraph position="9"> ASO has one parameter, the dimensionality of the structure matrix A2</Paragraph>
  </Section>
  <Section position="11" start_page="80" end_page="80" type="metho">
    <SectionTitle>
CY
</SectionTitle>
    <Paragraph position="0"> (i.e., the number of left singular vectors to compute). The performance shown in Figure 4 is the ceiling performance obtained at the best dimensionality (in CUBDBCBNBEBHBNBHBCBNBDBCBCBNBDBHBCBNA1A1A1CV). In Figure 5, we show the performance dependency on</Paragraph>
  </Section>
  <Section position="12" start_page="80" end_page="81" type="metho">
    <SectionTitle>
CY
</SectionTitle>
    <Paragraph position="0"> 's dimensionality when ASO is applied to all the problems at once (Figure 5 left), and when ASO is applied to the set of the noun disambiguation problems (Figure 5 right). In the left figure, the configuration BY BP CUBZBVCV (global context) produces better performance at a relatively low dimensionality.</Paragraph>
    <Paragraph position="1"> In the other configurations shown in these two figures, performance is relatively stable as long as the dimensionality is not too low.</Paragraph>
    <Section position="1" start_page="80" end_page="81" type="sub_section">
      <SectionTitle>
3.2 Multi-task learning procedure for WSD
</SectionTitle>
      <Paragraph position="0"> Based on the above results on the Senseval-2 test set, we develop the following procedure using the feature split and problem partitioning shown in Figure 6. Let C6BNCE, and BT be sets of disambiguation problems whose target words are nouns, verbs, and adjectives, respectively. We write A2 B4CYBND7B5 for the struc- null predictors for nouns predictors for verbs predictors for adjectives  ture matrix associated with the feature group CY and computed from a problem set D7. That is, we replace</Paragraph>
      <Paragraph position="2"> AF Apply ASO to the three sets of disambiguation problems (corresponding to nouns, verbs, and adjectives), separately, using the extended model (5) with BY BP CUC4BVBNCBCACV. As a result,  We fix the dimension of the structure matrix corresponding to global context features to 50. The dimensions of the other structure matrices are set to 0.9 times the maximum possible rank to ensure relatively high dimensionality. This procedure produces BIBKBMBDB1 on the Senseval-2 English lexical sample test set.</Paragraph>
    </Section>
    <Section position="2" start_page="81" end_page="81" type="sub_section">
      <SectionTitle>
3.3 Previous systems on Senseval-2 data set
</SectionTitle>
      <Paragraph position="0"> previous best systems on the Senseval-2 English lexical sample test set. Since we used this test set for the development of our method above, our performance should be understood as the potential performance.</Paragraph>
      <Paragraph position="1">  worth noting that our potential performance (68.1%) exceeds those of the previous best systems.</Paragraph>
      <Paragraph position="2"> Our single-task baseline performance is almost the same as LN02 (Lee and Ng, 2002), which uses SVM. This is consistent with the fact that we adopted LN02's feature design. FY02 (Florian and Yarowsky, 2002) combines classifiers by linear average stacking. The best system of the Senseval-2 competition was an early version of FY02. WSC04 used a polynomial kernel via the kernel Principal Component Analysis (KPCA) method (Sch&amp;quot;olkopf et al., 1998) with nearest neighbor classifiers.</Paragraph>
    </Section>
  </Section>
  <Section position="13" start_page="81" end_page="81" type="metho">
    <SectionTitle>
4 Evaluation on Senseval-3 tasks
</SectionTitle>
    <Paragraph position="0"> In this section, we evaluate the methods developed on the Senseval-2 data set above on the standard Senseval-3 lexical sample tasks.</Paragraph>
    <Section position="1" start_page="81" end_page="81" type="sub_section">
      <SectionTitle>
4.1 Our methods in multi-task and
</SectionTitle>
      <Paragraph position="0"> semi-supervised configurations In addition to the multi-task configuration described in Section 3.2, we test the following semi-supervised application of ASO. We first create auxiliary problems following Ando and Zhang (2005a)'s partiallysupervised strategy (Figure 8) with distinct feature maps A9</Paragraph>
    </Section>
  </Section>
  <Section position="14" start_page="81" end_page="81" type="metho">
    <SectionTitle>
BD
</SectionTitle>
    <Paragraph position="0"> and A9</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML