File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-0406_metho.xml

Size: 17,959 bytes

Last Modified: 2025-10-06 14:08:19

<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-0406">
  <Title>Unsupervised learning of word sense disambiguation rules by estimating an optimum iteration number in the EM algorithm</Title>
  <Section position="4" start_page="3" end_page="4" type="metho">
    <SectionTitle>
3 Unsupervised learning using EM
</SectionTitle>
    <Paragraph position="0"> algorithm We can use the EM method if we use Naive Bayes for classification problems. In this paper, we show only key equations and the key algorithm of this method(Nigam et al., 2000).</Paragraph>
    <Paragraph position="1"> Basically the method computes P(f</Paragraph>
    <Paragraph position="3"> is a class. This probability is given by</Paragraph>
    <Paragraph position="5"> In this paper we use the bunrui-goi-hyou as a Japanese thesaurus. null  This equation is smoothed by taking into account the frequency 0.</Paragraph>
    <Paragraph position="6"> D: all data consisting of labeled data and unlabeled data d k : an element in D F : the set of all features</Paragraph>
    <Paragraph position="8"> ) is initially 0, and is updated to an appropriate value step by step in proportion to the iteration of the EM algorithm.</Paragraph>
    <Paragraph position="9"> By using equation 2, the following classifier is constructed: null</Paragraph>
    <Paragraph position="11"> ) converge. In our experiment, when the difference between the current P(f</Paragraph>
    <Paragraph position="13"> ) comes to less than 8 * 10 [?]6 or the iteration number reaches 10 times, we judge that the algorithm has converged.</Paragraph>
    <Paragraph position="14"> 4 Estimation of the optimum iteration number In this paper, we propose two methods (CV-EM and CV-EM2) to estimate the optimum iteration number in the EM algorithm.</Paragraph>
    <Paragraph position="15"> The CV-EM method is cross validation. First of all, we divide labeled data into three parts, one of which is used as test data and the others are used as new labeled data. By using this new labeled data and huge unlabeled data, we conduct the EM method. After each iteration in the EM algorithm, the learned rules at the time are evaluated by using test data. This experiment is conducted three times by changing the labeled data and test data. The precision of each iteration number is given by the mean of three experiments. The optimum iteration number is estimated to be the iteration number at which the highest precision is achieved.</Paragraph>
    <Paragraph position="16"> The CV-EM2 method also uses cross validation, but estimates the optimum iteration number by ad-hoc mechanism. null First, we judge whether we can use the EM method without modification or not. To do this, we compare the precision at convergence with the precision of the iteration number 1. If the former is higher than the latter, we judge that we can use the EM method without modification. In this case, the optimum iteration number is estimated to be the converged number. On the other hand, if the former is not higher than the latter, we go to the second judgment, namely whether the EM method should be used or not. To judge this, we compare the two precisions of the iteration number 0 and 1. The iteration number 0 means that the EM method is not used. If the precision of the iteration number 0 is higher than the precision of the iteration number 1, we judge that the EM method should not be used. In this case, the optimum iteration number is estimated to be 0. Conversely, if the precision of the iteration number 1 is higher than the precision of the iteration number 0, we judge that the EM method should be used. In this case, the optimum iteration number is estimated to be the number obtained by CV-EM.</Paragraph>
    <Paragraph position="17"> In the many cases, the CV-EM2 outputs the same number as the CV-EM. However, the basic idea is different. Roughly speaking, the CV-EM2 relies on two heuristics: (1) Basically we only have to judge whether the EM method can be used or not, because the EM algorithm improves or degrades the precision monotonically. (2) Whether the EM algorithm succeeds correlates closely with whether the precision is improved by the first iteration of the EM algorithm. Therefore, we estimate the optimum iteration number by comparing three precisions, the precision of the iteration number 0, 1 and at convergence. null The figure 2 shows a typical case that the CV-EM2 differs from the CV-EM. In the cross validation, the precision is degraded by the first iteration of the EM algorithm, and then it is improved by iteration, and the maximum precision is achieved at the k-th iteration, but the precision converges to the lower point than the precision of the iteration number 1. In this case, the CV-EM gives k as the estimation, but the CV-EM2 gives 0.</Paragraph>
  </Section>
  <Section position="5" start_page="4" end_page="5" type="metho">
    <SectionTitle>
5 Experiments
</SectionTitle>
    <Paragraph position="0"> To confirm the effectiveness of our methods, we tested with 50 nouns of the Japanese Dictionary Task in SEN-SEVAL2(Kurohashi and Shirai, 2001).</Paragraph>
    <Paragraph position="1"> The Japanese Dictionary Task is a standard WSD problem. As the evaluation words, 50 noun words and 50 verb words are provided. These words are selected so as to balance the difficulty of WSD. The number of labeled instances for nouns is 177.4 on average, and for verbs is 172.7 on average. The number of test instances for each evaluation word is 100, so the number of test instances of noun and verb evaluation words is 5,000 respectively.</Paragraph>
    <Paragraph position="2"> However, unlabeled data are not provided. Note that we cannot use simple raw texts including the target word, because we must use the same dictionary and part of speech set as labeled data. Therefore, we use Mainichi newspaper articles for 1995 with word segmentations provided by RWC. This data is the origin of labeled data. As a result, we gathered 7585.5 and 6571.9 unlabeled instances for per noun and per verb evaluation word on average, respectively.</Paragraph>
    <Paragraph position="3"> Table 1 shows the results of experiments for noun evaluation words. In this table, NB means Naive Bayes, EM the EM method, and ideal the EM method stopping at the ideal iteration number. Note that the precision is computed by mixed-gained scoring(Kurohashi and Shirai, 2001) which gives partial points in some cases.</Paragraph>
    <Paragraph position="4"> The precision of Naive Bayes which learns through only labeled data was 76.58%. The EM method failed to boost it, and degraded it to 73.56%. On the other hand, by using CV-EM the precision was boosted to 77.88%. Furthermore, CV-EM2 boosted it to 78.56%. This score is a match for the best public score of this task. As successful results in this task, two researches are reported. One used Naive Bayes with various attributes, and achieved 78.22% precision(Murata et al., 2001). Another used Adaboost of decision trees, and achieved 78.47% precision(Nakano and Hirai, 2002). Our score is higher than these scores  . Furthermore, their methods used syntactic analysis, but our methods do not need it.</Paragraph>
    <Paragraph position="5"> In the same way, we performed experiments for verb evaluation words. Table 2 shows the results. In the experiment, Naive Bayes achieved 78.16% precision. The EM method boosted it to 78.74%. Furthermore, CV-EM and CV-EM2 boosted it to 79.22% and 79.26% respectively. CV-EM2 is marginally higher than CV-EM.</Paragraph>
    <Paragraph position="6">  The best score for the total of noun words and verb words is reported to be 79.33% in (Murata et al., 2001).</Paragraph>
  </Section>
  <Section position="6" start_page="5" end_page="5" type="metho">
    <SectionTitle>
6 Discussion
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="5" end_page="5" type="sub_section">
      <SectionTitle>
6.1 Cause of failure of the EM method
</SectionTitle>
      <Paragraph position="0"> Why does the EM method often fail to boost the performance? One reason may be the difference among class distributions of labeled data L, unlabeled data U and test data T. Practically L, U and T are the same because they consist of random samples from all data. However, there are differences among them.</Paragraph>
      <Paragraph position="1"> Intuitively, learning by combining labeled data and unlabeled data is regarded as learning from the distribution of L + U. It is expected that the EM method is effective if d = d(L, T)[?]d(L+U, T) &gt; 0, and is counterproductive if d&lt;0, in which d(*, *) means the distance of two distributions.</Paragraph>
      <Paragraph position="2"> To confirm the above expectation, we conduct an experiment by using Kullback-Leibler divergence as d(*, *). The distribution of L + U can be obtained from Equation 4 when the EM algorithm converges. The result of the experiment is shown in Table 3.</Paragraph>
      <Paragraph position="3">  The columns of the table are divided into positive (d&gt;0) and negative (d&lt;0). Positive means that L + U gets close to T and negative means that L+U goes away from T. The rows of the table are divided into improvement of precision and deterioration of precision. In this paper, improvement of precision is when the precision is improved by over 5%, and deterioration of precision is when the precision is degraded by over 5%.</Paragraph>
      <Paragraph position="4"> This result indicates that there is a weak correlation between whether L + U gets close to T or goes away from T and whether the EM method is effective or not, but we cannot conclude they are completely dependent. However, the evaluation word 'genzai' whose precision falls most by the EM method is precisely the above case. The d for this word is the smallest, -0.30, among all evaluation words. Further investigation of the causes of failure of the EM method is our future work.</Paragraph>
    </Section>
    <Section position="2" start_page="5" end_page="5" type="sub_section">
      <SectionTitle>
6.2 Effectiveness of estimation of CV-EM2
</SectionTitle>
      <Paragraph position="0"> CV-EM2 achieved ideal estimation for 29 of 50 evaluation words, that is 58%. Furthermore, for 15 of the other 21 evaluation words, the difference between the precision through our method and that through ideal estimation did not exceed 2%. Therefore, estimation of CV-EM2 is mostly effective.</Paragraph>
      <Paragraph position="1"> The words 'kokunai' and 'kotoba' are typical cases where estimation fails. The difference between the precision of CV-EM2 and that through ideal estimation exceeded 5%. The failure of estimation for these two words reduced the whole precision.</Paragraph>
      <Paragraph position="2"> Figure 3 compares the precision for cross validation and that for actual evaluation for the word 'kokunai'.In the same way, Figure 3 shows the case of the word 'kotoba'. In these figures, the x-axis shows the iteration number of the EM algorithm. To clarify the change of precision, the initial precision is set to 0, and the y-axis shows the difference (%) between the actual and initial precision.</Paragraph>
      <Paragraph position="3"> In the case of 'kokunai', the precision got worse in cross validation, but the precision got better in the actual evaluation. This means that cross validation is useless, so it is difficult to estimate an optimum iteration number in the EM algorithm. However, such cases are rare. In the experiment, this case arises for only this word 'kokunai'. Consider next the case of 'kotoba'. In cross validation, the precision improved in the first iteration of the EM algorithm, but got worse step by step thereafter.</Paragraph>
      <Paragraph position="4"> On the other hand, in the actual evaluation, the precision got worse even in the first iteration of the EM algorithm. The difference of these results in the first iteration of the EM algorithm causes our estimation to fail. In future, we must improve our method by further investigation of these words.</Paragraph>
    </Section>
    <Section position="3" start_page="5" end_page="5" type="sub_section">
      <SectionTitle>
6.3 Comparison of CV-EM and CV-EM2
</SectionTitle>
      <Paragraph position="0"> CV-EM2 is slightly superior to CV-EM. In the evaluation word 'doujitu', there is a remarkable difference between the two methods.</Paragraph>
      <Paragraph position="1"> Figure 5 shows the change of the precision for 'doujitsu' in cross validation, and Figure 6 shows that in actual evaluation.</Paragraph>
      <Paragraph position="2"> The precision goes up in cross validation, but goes down largely in actual evaluation. In CV-EM, the best point is selected in cross validation, that is 3. On the other hand, CV-EM2 estimates 0 by using the relation of three precisions: the initial precision, the precision for the iteration 1 and the precision at convergence.</Paragraph>
      <Paragraph position="3"> Let's count the number of words for which CV-EM2 is better or worse than CV-EM. For one word 'mae' in nouns and three words 'kuru', 'koeru' and 'tukuru' in verbs, CV-EM was superior to CV-EM2. On the other hand, for four words 'atama', 'kaku n', 'te' and 'doujitsu' in nouns and four words 'ukeru', 'umareru', 'toru' and 'mamortu' in verbs, CV-EM2 was better to CV-EM.</Paragraph>
      <Paragraph position="4"> These numbers show that our method is somewhat superior to CV-EM.</Paragraph>
    </Section>
    <Section position="4" start_page="5" end_page="5" type="sub_section">
      <SectionTitle>
6.4 Unsupervised learning for verb WSD
</SectionTitle>
      <Paragraph position="0"> In the experiments, CV-EM and CV-EM2 improved the EM method for both noun words and verb words. The effectiveness of these methods was large for noun words, but was small for verb words. We believe that the cause of this difference is the difficulty of unsupervised learning for verb WSD. In ideal estimation, the precision for noun words was boosted from 76.78% to 79.64% by the EM method, that is 1.037 times. On the other hand, the precision for verb words was boosted from 78.16% to 79.92% by the EM method, that is 1.022 times. This shows that the EM method does not work so well for verb words.</Paragraph>
      <Paragraph position="1"> We consider that feature independence plays a key role in unsupervised learning. Suppose the instance x consists of two features f  ). If it is right, unsupervised learning works well, but if it is not, unsupervised learning fails. Intuitively, feature independence warrants in- null ). In noun WSD, the left context of the target word corresponds to the words modifying the target word, and the right context of the target word corresponds to the verb word whose case slot can have the target word. Both the left context and right context can judge the meaning of the target word by itself, and are independent. Left context and right context act as independent features. On the other hand, we cannot find such an opportune interpretation for the features of verbs (Shinnou, 2002). Therefore, the EM method is not so effective for verb words.</Paragraph>
      <Paragraph position="2"> Naive Bayes assumes the independence of features, too. However, this assumption is not so rigid in practice. We believe that the improvement by the EM method for verb words depends on the robustness of Naive Bayes. In our experiments, the EM method for noun words failed to boost the precision. We think that the cause is the imbalance of labeled data, unlabeled data and test data. We should investigate this in a future study.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="5" end_page="5" type="metho">
    <SectionTitle>
6.5 Related works
</SectionTitle>
    <Paragraph position="0"> Co-training(Blum and Mitchell, 1998) is a powerful unsupervised learning method. In Co-training, if we can find two independent feature sets for the target problem, any supervised learning method can be used. Furthermore, it is reported that Co-training is superior to the EM method if complete independent feature sets can be used(Nigam and Ghani, 2000). However, Co-training requires consistency besides independence for two feature sets. This condition makes it difficult to apply Co-training to multiclass classification problems. On the other hand, the EM method requires Naive Bayes to be used as the supervised learning method, but can be applied to multiclass classification problems without any modification. Therefore, the EM method is more practical than Co-training.</Paragraph>
    <Paragraph position="1"> Yarowsky proposed the unsupervised learning method for WSD(Yarowsky, 1995). His method is reported to be a special case of Co-training(Blum and Mitchell, 1998).</Paragraph>
    <Paragraph position="2"> As two independent feature sets, one is the context surrounding the target word and the other is the heuristic of 'one sense per discourse'. However, it is unknown how valid this heuristic is for granularity of meanings of our evaluation words. Furthermore, this method needs documents in which the target word appears multiple times, as unlabeled data. Therefore, it is not so easy to gather unlabeled data. On the other hand, the EM method does not have such problem because it uses sentences including the target word as unlabeled data.</Paragraph>
    <Section position="1" start_page="5" end_page="5" type="sub_section">
      <SectionTitle>
6.6 Future works
</SectionTitle>
      <Paragraph position="0"> We have three future works. First, we must raise the precision for verb words, which may be impossible unless we use other features, so we need to investigate other features. Second, we must improve the estimation method of the optimum iteration number in the EM algorithm. The difference between the precision through our estimation and that through the ideal estimation is large. We can improve the accuracy by improving the estimation method.</Paragraph>
      <Paragraph position="1"> Finally, we will investigate the reason for the failure of the EM method, which may be the key to unsupervised learning.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML