XML Viewer - c04-1132

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/c04-1132_metho.xml
Size: 20,661 bytes
Last Modified: 2025-10-06 14:08:47
<?xml version="1.0" standalone="yes"?>
<Paper uid="C04-1132">
  <Title>Learning a Robust Word Sense Disambiguation Model using Hypernyms in Definition Sentences</Title>
  <Section position="3" start_page="0" end_page="2" type="metho">
    <SectionTitle>
2 SVM Classifier
</SectionTitle>
    <Paragraph position="0"> The first classifier is the SVM classifier. Since SVM is a supervised learning algorithm, a word sense-tagged corpus is required as training data, and the classifier can not be used to disambiguate words which do not occur frequently in the data. However, as the effectiveness of SVM has been widely reported for a variety of NLP tasks including WSD (Murata et al., 2001; Takamura et al., 2001), we know that it will work well for disambiguation of high frequency words.</Paragraph>
    <Paragraph position="1"> When training the SVM classifier, each training instance should be represented by a feature vector. We used the following features, which are typical for WSD.</Paragraph>
    <Paragraph position="3"> Surface forms of a target word and words just before or after a target word. A number in parentheses indicates the position of a word from a target word.</Paragraph>
    <Paragraph position="4">  Semantic classes of content words in a sentence. Semantic classes used here are derived from the Japanese thesaurus &amp;quot;Nihongo-Goi-Taikei&amp;quot; (Ikehara et al., 1997).</Paragraph>
    <Paragraph position="5">  ) and a head verb (B verb ) when the target word is a case filler noun of a certain verb.</Paragraph>
    <Paragraph position="6"> We used the LIBSVM package  for training the SVM classifier. The SVM model is n[?]SVM (Sch&amp;quot;olkopf, 2000) with a linear kernel, where the parameter n =0.0001. The pairwise method is used to apply SVM to multi classification. null</Paragraph>
  </Section>
  <Section position="4" start_page="2" end_page="3" type="metho">
    <SectionTitle>
3 Naive Bayes Classifier using
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
Hypernyms in Definition
Sentences
</SectionTitle>
      <Paragraph position="0"> In this section, we will describe the details of the WSD classifier using hypernyms of words  We tried using the special symbol &amp;quot;NUM&amp;quot; as a feature for any numbers in a sentence, but the performance was slightly worse in our experiment. We thank the anonymous reviewer who gave us the comment about this.</Paragraph>
    </Section>
    <Section position="2" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
3.1 Overview
</SectionTitle>
      <Paragraph position="0"> Let us explain the basic idea of the model by considering the case in which the word &amp;quot;&amp;quot; (mandan; comic chat) in the following example sentence (A) should be disambiguated: (A)T^xw Htz...</Paragraph>
      <Paragraph position="1"> (Mr. Sakano was initiated into the world of comic chat ...) In this paper, word senses are defined according to the EDR concept dictionary (EDR, 1995). Figure 1 illustrates two meanings for &amp;quot;&amp;quot; (comic chat) in the EDR concept dictionary. &amp;quot;CID&amp;quot; indicates a concept ID, an identification number of a sense.</Paragraph>
      <Paragraph position="2"> One of the ways to disambiguate the senses of &amp;quot;&amp;quot; (comic chat) is to train the WSD classifier from the sense-tagged corpus, as with the SVM classifier. However, when &amp;quot;&amp;quot;(comic chat) occurs infrequently or not at all in the training corpus, we can not train any reliable classifiers.</Paragraph>
      <Paragraph position="3"> To train the WSD classifier for low frequency words, we looked at hypernyms of senses in definition sentences. For Japanese, in most cases the last word in a definition sentence is a hypernym. For example, the hypernym of sense 3c5631 in Figure 1 is the last underlined word &amp;quot;3&amp;quot;(engei; entertainment), while the hypernym of 1f66e3 is &amp;quot;&amp;quot;(hanashi; story). In the EDR concept dictionary, there are senses whose hypernyms are also &amp;quot;3&amp;quot;(entertainment) or &amp;quot;&amp;quot; (story). For example, as shown in Figure 2, 10d9a4, 3c3fbb and 3c5ab3 are senses whose hypernyms are &amp;quot;3&amp;quot;(entertainment), while the hypernym of 3cf737, 0f73c1 and 3c3071 is &amp;quot;&amp;quot; (story). If these senses occur in the training corpus, we can train a classifier that determines whether the hypernym of &amp;quot;&amp;quot;(comicchat)is&amp;quot;3&amp;quot; (entertainment) or &amp;quot;&amp;quot; (story). If we can determine the correct hypernym, we can also determine which is the correct sense, 3c5631 or 1f66e3. Notice that we can train such a model even when &amp;quot;&amp;quot;(comic  0efb60 of the word &amp;quot;&amp;quot; chat) itself does not occur in the training corpus. null As described later, we train the probabilistic model that predicts a hypernym of a given word, instead of a word sense. Much more training data will be available to train the model predicting hypernyms rather than the model predicting senses, because there are fewer types of hypernyms than of senses. Figure 2 illustrates this fact clearly: all words labeled with 10d9a4, 3c3fbb and 3c5ab3 in the training data can be used as the data labeled with the hypernym &amp;quot;3&amp;quot; (entertainment). In this way, we can train a reliable WSD classifier for low frequency words. Furthermore, hypernyms will be automatically extracted from definition sentences, as described in Subsection 3.2, so that the model can be automatically trained without human intervention.</Paragraph>
    </Section>
    <Section position="3" start_page="2" end_page="3" type="sub_section">
      <SectionTitle>
3.2 Extraction of Hypernyms
</SectionTitle>
      <Paragraph position="0"> In this subsection, we will describe how to extract hypernyms from definition sentences in a dictionary. In principle, we assume that the hypernym of a sense is the last word of a definition sentence. For example, in the definition sentence of sense 3c5631 of &amp;quot;&amp;quot; (comic chat), the last word &amp;quot;3&amp;quot; (entertainment) is the hypernym, as shown in Figure 1. However, we cannot always regard the last word as a hypernym.</Paragraph>
      <Paragraph position="1"> Let us consider the definition of sense 0eb70d of the word &amp;quot;&amp;quot;(g^emu; game). In the EDR concept dictionary, the expression &amp;quot;A// b/&amp;quot;(awordrepresentingA) often appears in definition sentences. In this case, the hypernym of the sense is not the last word but A.Thus the hypernym of 0efb60 is not the last word &amp;quot;&amp;quot;(go; word) but &amp;quot;s :&amp;quot;(kaisuu;number) in Figure 3.</Paragraph>
      <Paragraph position="2"> When we extract a hypernym from a definition sentence, the definition sentence is first analyzed morphologically (word segmentation and POS tagging) by ChaSen  .Thenahypernym in a definition sentence is identified by pattern matching. An example of patterns used here is the rule extracting A when the expression &amp;quot;A/ /b/&amp;quot; is found in a definition sentence.</Paragraph>
      <Paragraph position="3"> We made 64 similar patterns manually in order to extract hypernyms appropriately.</Paragraph>
      <Paragraph position="4"> Out of the 194,303 senses of content words in the EDR concept dictionary, hypernyms were extracted for 191,742 senses (98.7%) by our pattern matching algorithm. Furthermore, we chose 100 hypernyms randomly and checked their validity, and found that 96% of the hypernyms were appropriate. Therefore, our method for extracting hypernyms worked well. The major reasons why acquisition of hypernyms failed were lack of patterns and faults in the morphological analysis of definition sentences.</Paragraph>
    </Section>
    <Section position="4" start_page="3" end_page="3" type="sub_section">
      <SectionTitle>
3.3 Naive Bayes Model
</SectionTitle>
      <Paragraph position="0"> We will describe the details of our probabilistic model that considers hypernyms in definition sentences. First of all, let us consider the following probability:</Paragraph>
      <Paragraph position="2"> In (1), s is a sense of a target word, c is a hypernym extracted from the definition sentence of s,andF is the set of features representing an input sentence including a target word.</Paragraph>
      <Paragraph position="3">  ChaSen is the Japanese morphological analyzer. http://chasen.aist-nara.ac.jp/hiki/ChaSen/ Next, we approximate Equation (1) as (2):</Paragraph>
      <Paragraph position="5"> (2) The first term, P(s|c,F), is the probabilistic model that predicts a sense s given a feature set F (and c). It is similar to the ordinary Naive Bayes model for WSD (Pedersen, 2000). However, we assume that this model can not be trained for low frequency words due to a lack of training data. Therefore, we approximate P(s|c,F)toP(s|c).</Paragraph>
      <Paragraph position="6"> Using Bayes' rule, Equation (2) can be computed as follows:</Paragraph>
      <Paragraph position="8"> Notice that P(c|s) in (3) is equal to 1, because ahypernymc of a sense s is uniquely extracted by pattern matching (Subsection 3.2).</Paragraph>
      <Paragraph position="9"> As all we want to do is to choose an s</Paragraph>
      <Paragraph position="11"> Finally, by the Naive Bayes assumption, that is all features in F are conditionally independent, Equation (6) can be approximated as follows:</Paragraph>
      <Paragraph position="13"> In (7), P(s) is the prior probability of a sense s which reflects statistics of the appearance of senses, while P(f</Paragraph>
      <Paragraph position="15"> |c) is the posterior probability which reflects collocation statistics between an individual feature f</Paragraph>
      <Paragraph position="17"> and a hypernym c.The parameters of these probabilistic models can be estimated from the word sense-tagged corpus. We estimated P(s) by Expected Likelihood Estimation and P(f</Paragraph>
      <Paragraph position="19"> almost same as ones used in the SVM classifier except for the following features: [Features not used in the Naive Bayes model]</Paragraph>
      <Paragraph position="21"> According to the preliminary experiment, the accuracy of the Naive Bayes model slightly decreased when all features in the SVM classifier were used. This was the reason why we did not use the above features.</Paragraph>
    </Section>
    <Section position="5" start_page="3" end_page="3" type="sub_section">
      <SectionTitle>
3.4 Discussion
</SectionTitle>
      <Paragraph position="0"> The following discussion examines our method for extracting hypernyms from definition sentences. null</Paragraph>
    </Section>
    <Section position="6" start_page="3" end_page="3" type="sub_section">
      <SectionTitle>
Multiple Hypernyms
</SectionTitle>
      <Paragraph position="0"> In general, two or more hypernyms can be extracted from a definition sentence, when the definition of a sense consists of several sentences or a definition sentence contains a coordinate structure. However, for this work we extracted only one hypernym for a sense, because definitions of all senses in the EDR concept dictionary are described by a single sentence, and most of them contain no coordinate structure.</Paragraph>
      <Paragraph position="1"> In order to apply our model for multiple hypernyms, we must consider the probabilistic model P(s,C|F) instead of Equation (1), where C is a set of hypernyms. Unfortunately, the estimation of P(s,C|F) is not obvious, so investigation of this will be done in future.</Paragraph>
      <Paragraph position="2"> Ambiguity of hypernyms The fact that hypernyms may have several meanings does not appear to be a major problem, because most hypernyms in definition sentences of a certain dictionary have a single meaningaccordingtoourroughobservation. So for this work we ignored the possible ambiguity of hypernyms.</Paragraph>
      <Paragraph position="3"> Using other dictionaries As described in Subsection 3.2, hypernyms are extracted by pattern matching. We would have to rebuild these patterns when we use other dictionaries, but we do not expect to require too much labor. Generally, in Japanese the last word in a definition sentence can be regarded as a hypernym. Furthermore, many extraction patterns for the EDR concept dictionary may also be applicable for other dictionaries. We are already building patterns to extract hypernyms from the other major Japanese dictionary, the Iwanami Kokugo Jiten, and developing the WSD system that will use them.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="3" end_page="3" type="metho">
    <SectionTitle>
4 Combined Model
</SectionTitle>
    <Paragraph position="0"> The details of two WSD classifiers are described in the previous two sections: one is the SVM classifier for high frequency words, and the other is the Naive Bayes classifier for low frequency words. These two classifiers are combined to construct the robust WSD system. We developed two kinds of combined models, described below in subsections 4.1 and 4.2.</Paragraph>
    <Section position="1" start_page="3" end_page="3" type="sub_section">
      <SectionTitle>
4.1 Simple Ensemble
</SectionTitle>
      <Paragraph position="0"> In this model, the process combining the two classifiers is quite simple. When only one of classifiers, SVM or Naive Bayes, outputs senses for a given word, the combined model outputs senses provided by that classifier. When both classifiers output senses, the ones provided by the SVM classifier are always chosen for the final output.</Paragraph>
      <Paragraph position="1"> In the experiment in Section 5, SVM classifiers were trained for words which occur more than 20 times in the training corpus. Therefore, the simple ensemble described here is summarized as follows: we use the SVM classifier for high frequency words those which occur more than 20 times and the Naive Bayes classifier for the low frequency words.</Paragraph>
    </Section>
    <Section position="2" start_page="3" end_page="3" type="sub_section">
      <SectionTitle>
4.2 Ensemble using Validation Data
</SectionTitle>
      <Paragraph position="0"> First, we prepare validation data, which is a sense-tagged corpus, as common test data for the classifiers. The performance of the classifiers for a word w is evaluated by correctness</Paragraph>
      <Paragraph position="2"> # of words in which one of the senses selected by a classifier is correct # of words for which a classifier selects one or more senses (8) The main reason for combining two classifiers is to improve the recall and applicability of the WSD system. Note that a classifier which often outputs a correct sense would achieve high cor-</Paragraph>
      <Paragraph position="4"> , even though it also outputs wrong senses. Thus, the higher the C w of a classifier, the more it improves the recall of the combined model.</Paragraph>
      <Paragraph position="5"> Next, the correctness C w of each classifier for each word w is measured on the validation data. When two classifiers output senses for a given word, their C w scores are compared. Then, the word senses provided by the better classifier are selected as the final outputs. When the number of words in the validation data is small, comparison of the classifiers' C w is unreliable. For that reason, when the number of words in the validation data is less that a certain threshold O h , a sense output by the SVM classifier is chosen for the final output. This is because the correctness for all words in the validation data is higher for the SVM classifier than for the Naive Bayes classifier. In the experiment in Section 5, we set O h to 10.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="3" end_page="3" type="metho">
    <SectionTitle>
5 Experiment
</SectionTitle>
    <Paragraph position="0"> In this section, we will describe the experiment to evaluate our proposed method. We used the EDR corpus (EDR, 1995) in the experiment. It is made up of about 200,000 Japanese sentences extracted from newspaper articles and magazines. In the EDR corpus, each word was annotated with a sense ID (CID). We used 20,000 sentences in the EDR corpus as the test data, 20,000 sentence as the validation data, and the remaining 161,332 sentences as the training data. The training data was used to train the SVM classifier and the Naive Bayes classifier, while the validation data was used for the combined model described in Subsection 4.2. The target instances used for evaluation were all ambiguous content words in the test data; the number of target instances was 91,986.</Paragraph>
    <Paragraph position="1"> We evaluated three single WSD classifiers and two combined models:</Paragraph>
  </Section>
  <Section position="7" start_page="3" end_page="10" type="metho">
    <SectionTitle>
* BL
</SectionTitle>
    <Paragraph position="0"> The baseline model. This is the WSD classifier which always selects the most frequently used sense. When there is more than one sense with equally high frequency, the classifier chooses all those senses.</Paragraph>
    <Paragraph position="1">  where P and R represent the precision and recall, respectively.</Paragraph>
    <Paragraph position="2">  data. A(applicability) indicates the ratio of the number of instances disambiguated by a classifier to the total number of target instances; T indicates the number of word types which could be disambiguated by a classifier.</Paragraph>
    <Paragraph position="3"> The two combined models outperformed the SVM classifier, for all criteria except precision. The gains in recall and applicability were especially remarkable. Notice the figures in column &amp;quot;T&amp;quot; in Table 1: the SVM classifiers could be applied only to 4,575 words, while the Naive Bayes classifiers were applicable to 10,501 words, including low frequency words. Thus, the ensemble of these two classifiers would significantly improve applicability and recall with little loss of precision.</Paragraph>
    <Paragraph position="4"> Comparing the performance of the two combined models, &amp;quot;SVM+NB(validation)&amp;quot; slightly outperformed &amp;quot;SVM+NB(simple)&amp;quot;, but there was no significant difference between them. The correctness, C w , of the SVM classifier on the validation data was usually greater than that of the Naive Bayes classifier, so the SVM classifier was preferred when both were applicable. This was the almost same strategy for the simple ensemble, and we think this was the reason why the performance of two combined models were almost the same. In the rest of this section, we will show the results for the combined model using the validation data only.</Paragraph>
    <Paragraph position="5"> Our goal was to improve the robustness of the WSD system. The naive way to construct a robust WSD system is to create an ensemble of a supervised learned classifier and a baseline classifier. So, we compared our proposed method (SVM+NB) with the combined model of the SVM and baseline classifier (SVM+BL). The results are shown in Table 2 and Figure 4. Table 2 shows the same criteria as in Table 1, indicating that &amp;quot;SVM+NB&amp;quot; outperformed &amp;quot;SVM+BL&amp;quot; for all criteria. Figure 4 shows the relation between the F-measure of the classifiers and word frequency in the training data. The horizontal axis  indicates the occurrence of words in the trainingdata(o) in log scale. Squares and triangles with lines indicates the F(o)ofthe&amp;quot;SVM+NB&amp;quot; and &amp;quot;SVM+BL&amp;quot;, respectively, where F(o)isthe macro average of F-measures for words which occur o times in the training data. The broken line indicates N(o), the number of word types  0). As shown in Figure 4, &amp;quot;SVM+NB&amp;quot; significantly outperformed &amp;quot;SVM+BL&amp;quot; for low frequency words, and the number of word types (N(o)) became obviously greater when o was small. In other words, the Naive Bayes classifier proposed here could probably handle many more low frequency words than the baseline classifier. Therefore, it was more effective to combine the Naive Bayes classifier with the SVM classifier rather than the baseline classifier in order to improve the robustness of the overall WSD system.</Paragraph>
    <Paragraph position="6"> Finally, we constructed a combined model of all three classifiers, the SVM, Naive Bayes and baseline classifiers. As shown in Table 3, this model slightly outperformed the two-classifier combined models shown in Table 2.</Paragraph>
    <Paragraph position="7">  To be more accurate, F(o)andN(o) are the figures for words which occurred more than or equal o times and less than o+t times, where o+t is the next point at the horizontal axis. t was chosen as the smallest integer so that N(o) would be more than 100.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML