File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-0847_metho.xml

Size: 10,196 bytes

Last Modified: 2025-10-06 14:09:12

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-0847">
  <Title>Optimizing Feature Set for Chinese Word Sense Disambiguation</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Word sense disambiguation (WSD) is to assign appropriate meaning to a given ambiguous word in a text. Corpus based method is one of the successful lines of research on WSD. Many supervised learning algorithms have been applied for WSD, ex. Bayesian learning (Leacock et al., 1998), exemplar based learning (Ng and Lee, 1996), decision list (Yarowsky, 2000), neural network (Towel and Voorheest, 1998), maximum entropy method (Dang et al., 2002), etc.. In this paper, we employ Naive Bayes classifier to perform WSD.</Paragraph>
    <Paragraph position="1"> Resolving the ambiguity of words usually relies on the contexts of their occurrences. The feature set used for context representation consists of local and topical features. Local features include part of speech tags of words within local context, morphological information of target word, local collocations, and syntactic relations between contextual words and target word, etc.. Topical features are bag of words occurred within topical context. Contextual features play an important role in providing discrimination information for classifiers in WSD.</Paragraph>
    <Paragraph position="2"> In other words, an informative feature set will help classifiers to accurately disambiguate word senses, but an uninformative feature set will deteriorate the performance of classifiers. In this paper, we optimize feature set by maximizing the cross validated accuracy of Naive Bayes classifier on sense tagged training data.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Naive Bayes Classifier
</SectionTitle>
    <Paragraph position="0"/>
    <Paragraph position="2"> value of fj, 1 * j * M, is 1 if fj is present in the context of target word, otherwise 0. In classification process, the Naive Bayes classifier tries to find the class that maximizes P(cijF), the probability of class ci given feature set F, 1 * i * L.</Paragraph>
    <Paragraph position="3"> Assuming the independence between features, the classification procedure can be formulated as:</Paragraph>
    <Paragraph position="5"> where p(ci), p(fjjci) and p(fj) are estimated using maximum likelihood method. To avoid the effects of zero counts when estimating p(fjjci), the zero counts of p(fjjci) are replaced with p(ci)=N, where N is the number of training examples.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Feature Set
</SectionTitle>
    <Paragraph position="0"> For Chinese WSD, there are two strategies to extract contextual information. One is based on Chinese characters, the other is to utilize Chinese words and related morphological or syntactic information. In our system, context representation is based on Chinese words, since words are less ambiguous than characters.</Paragraph>
    <Paragraph position="1"> We use two types of features for Chinese WSD: local features and topical features. All of these features are acquired from data at senseval3 without utilization of any other knowledge resource.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Local features
</SectionTitle>
      <Paragraph position="0"> Two sets of local features are investigated, which are represented by LocalA and LocalB. Let nl denote the local context window size.</Paragraph>
      <Paragraph position="1"> LocalA contains only part of speech tags with position information: POS!nl; ...,</Paragraph>
      <Paragraph position="3"> i-th words to the left (right) of target word w, and POS0 is the POS of w.</Paragraph>
      <Paragraph position="4"> Association for Computational Linguistics for the Semantic Analysis of Text, Barcelona, Spain, July 2004  SENSEVAL-3: Third International Workshop on the Evaluation of Systems LocalB enriches the local context by including the following features: local words with position information (W!nl, ..., W!1, W+1, ..., W+nl), bigram templates ((W!nl; W!(nl!1)), ..., (W!1; W+1), ..., (W+(nl!1); W+nl)), local words with POS tags (W POS) (position information is not considered), and part of speech tags with position information.  All of these POS tags, words, and bigrams are gathered and each of them contributed as one feature. For a training or test example, the value of some feature is 1 if it occurred in local context, otherwise it is 0. In this paper, we investigate two values of nl for LocalA and LocalB, 1 and 2, which results in four feature sets.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Topical features
</SectionTitle>
      <Paragraph position="0"> We consider all Chinese words within a context window size nt as topical features. For each training or test example, senseval3 data provides one sentence as the context of ambiguous word. In senseval3 Chinese training data, all contextual sentences are segmented into words and tagged with part of speech.</Paragraph>
      <Paragraph position="1"> Words which contain non-Chinese character are removed, and remaining words occurred within context window size nt are gathered. Each remaining word is considered as one feature. The value of topical feature is 1 if it occurred within window size nt, otherwise it is 0.</Paragraph>
      <Paragraph position="2"> In later experiment, we set different values for nt, ex. 1, 2, 3, 4, 5, 10, 20, 30, 40, 50. Our experimental result indicated that the accuracy of sense disambiguation is related to the value of nt. For different ambiguous words, the value of nt which yields best disambiguation accuracy is different. It is desirable to determine an optimal value, ^nt, for each ambiguous word by maximizing the cross validated accuracy.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Data Set
</SectionTitle>
    <Paragraph position="0"> In Chinese lexical sample task, training data consists of 793 sense-tagged examples for 20 ambiguous Chinese words. Test data consists of 380 untagged examples for the same 20 target words. Table 1 shows the details of training data and test data.</Paragraph>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Criterion for Evaluation of Feature Sets
</SectionTitle>
    <Paragraph position="0"> In this paper, five fold cross validation method was employed to estimate the accuracy of our classifier, which was the criterion for evaluation of feature sets. All of the sense tagged examples of some target word in senseval3 training data were shuffled and divided into five equal folds. We used four folds as training set and the remaining fold as test set. This procedure was repeated five times under different division between training set and test set.</Paragraph>
    <Paragraph position="1"> The average accuracy over five runs is defined as the accuracy of our classifier.</Paragraph>
  </Section>
  <Section position="8" start_page="0" end_page="0" type="metho">
    <SectionTitle>
6 Evaluation of Feature Sets
</SectionTitle>
    <Paragraph position="0"> Four feature sets were investigated: FEATUREA1: LocalA with nl = 1, and topical feature within optimal context window size ^nt; FEATUREA2: LocalA with nl = 2, and topical feature within optimal context window size ^nt; FEATUREB1: LocalB with nl = 1, and topical feature within optimal context window size ^nt; FEATUREB2: LocalB with nl = 2, and topical feature within optimal context window size ^nt.</Paragraph>
    <Paragraph position="1"> We performed training and test procedure using exactly same training and test set for each feature set. For each word, the optimal value of topical context window size ^nt was determined by selecting a minimal value of nt which maximized the cross validated accuracy.</Paragraph>
    <Paragraph position="2"> Table 2 summarizes the results of Naive Bayes classifier using four feature sets evaluated on senseval3 Chinese training data. Figure 1 shows the accuracy of Naive Bayes classifier as a function of topical context window size on four nouns and three verbs. Several results should be noted specifically: If overall accuracy over 20 Chinese characters is used as evaluation criterion for feature set, the four feature sets can be sorted as follows: FEATUREA1 &gt; FEATUREA2 ...</Paragraph>
    <Paragraph position="3"> FEATUREB1 &gt; FEATUREB2. This indicated that simply increasing local window size or enriching feature set by incorporating bigram templates, local word with position information, and local words with POS tags did not improve the performance of sense disambiguation.</Paragraph>
    <Paragraph position="4"> In table 2, it showed that with FEATUREA1, the optimal topical context window size was less than 10 words for 13 out of 20 target words. Figure 1 showed that for most of nouns and verbs, Naive Bayes classifier achieved best disambiguation accuracy with small topical context window size (&lt;10 words). This gives the evidence that for most of Chinese words, including nouns and verbs, the near distance context is more important than the long distance context for sense disambiguation.</Paragraph>
  </Section>
  <Section position="9" start_page="0" end_page="0" type="metho">
    <SectionTitle>
7 Experimental Result
</SectionTitle>
    <Paragraph position="0"> The empirical study in section 6 showed that FEATUREA1 performed best among all the feature sets.</Paragraph>
    <Paragraph position="1"> A Naive Bayes classifier with FEATUREA1 as feature set was learned from all the senseval3 Chinese training data for each target word. Then we used  this classifier to determine the senses of occurrences of target words in test data. The official result of I2R!WSD system in Chinese lexical sample task is listed below: Precision: 60.40% (229.00 correct of 379.00 attempted). null Recall: 60.40% (229.00 correct of 379.00 in total). null Attempted: 100.00% (379.00 attempted of 379.00 in total).</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML