File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/c02-1143_metho.xml
Size: 15,281 bytes
Last Modified: 2025-10-06 14:07:51
<?xml version="1.0" standalone="yes"?> <Paper uid="C02-1143"> <Title>Simple Features for Chinese Word Sense Disambiguation</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 English Experiment </SectionTitle> <Paragraph position="0"> Our maximum entropy WSD system was designed to combine information from many different sources, using as much linguistic knowledge as could be gathered automatically by current NLP tools. In order to extract the linguistic features necessary for the model, all sentences were first automatically part-of-speech-tagged using a maximum entropy tagger (Ratnaparkhi, 1998) and parsed using the Collins parser (Collins, 1997). In addition, an automatic named entity tagger (Bikel et al., 1997) was run on the sentences to map proper nouns to a small set of semantic classes.</Paragraph> <Paragraph position="1"> Chodorow, Leacock and Miller (Chodorow et al., 2000) found that different combinations of topical and local features were most effective for disambiguating different words. Following their work, we divided the possible model features into topical features and several types of local contextual features. Topical features looked for the presence of key-words occurring anywhere in the sentence and any surrounding sentences provided as context (usually one or two sentences). The set of 200-300 keywords is specific to each lemma to be disambiguated, and is determined automatically from training data so as to minimize the entropy of the probability of the senses conditioned on the keyword.</Paragraph> <Paragraph position="2"> The local features for a verb a4 in a particular sentence tend to look only within the smallest clause containing a4 . They include collocational features requiring no linguistic preprocessing beyond part-of-speech tagging (1), syntactic features that capture relations between the verb and its complements (2-4), and semantic features that incorporate information about noun classes for subjects and objects (5-6): 1. the word a4 , the part of speech of a4 , the part of speech of words at positions -1 and +1 relative to a4 , and words at positions -2, -1, +1, +2, relative to a4 2. whether or not the sentence is passive 3. whether there is a subject, direct object, indirect object, or clausal complement (a complement whose node label is S in the parse tree) 4. the words (if any) in the positions of subject, direct object, indirect object, particle, prepositional complement (and its object) 5. a Named Entity tag (PERSON, ORGANIZA-TION, LOCATION) for proper nouns appearing in (4) 6. WordNet synsets and hypernyms for the nouns appearing in (4)</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 English Results </SectionTitle> <Paragraph position="0"> The maximum entropy system's performance on the verbs from the evaluation data for SENSEVAL-1 (Kilgarriff and Rosenzweig, 2000) rivaled that of the best-performing systems. We looked at the effect of adding topical features to local features that either included WordNet class features or used just lexical and named entity features. In addition, we experimented to see if performance could be improved by undoing passivization transformations to recover underlying subjects and objects. This was expected to increase the accuracy with which verb arguments could be identified, helping in cases where selectional restrictions on arguments played an important role in differentiating between senses.</Paragraph> <Paragraph position="1"> The best overall variant of the system for verbs did not use WordNet class features, but included topical keywords and passivization transformation, giving an average verb accuracy of 72.3%. If only the best combination of feature sets for each verb is used, then the maximum entropy models achieve 73.7% accuracy. These results are not significantly different from the reported results of the best-performing systems (Yarowsky, 2000).</Paragraph> <Paragraph position="2"> Our system was competitive with the top performing systems even though it used only the training data provided and none of the information from the dictionary to identify multi-word constructions.</Paragraph> <Paragraph position="3"> Later experiments show that the ability to correctly identify multi-word constructions improves performance substantially.</Paragraph> <Paragraph position="4"> We also tested the WSD system on the verbs from the English lexical sample task for SENSEVAL-2.1 In contrast to SENSEVAL-1, senses involving multi-word constructions could be identified directly from the sense tags themselves, and the head word and satellites of multi-word constructions were explicitly marked in the training and test data. This additional annotation made it much easier to incorporate information about the satellites, without having to look at the dictionary (whose format may vary from one task to another). All the best-performing systems on the English verb lexical sample task filtered out possible senses based on the marked satellites, and this improved performance.</Paragraph> <Paragraph position="5"> Table 1 shows the performance of the system using different subsets of features. In general, adding features from richer linguistic sources tended to improve accuracy. Adding syntactic features to collocational features proved most beneficial in the absence of topical keywords that could detect some of the complements and arguments that would normally be picked up by parsing (complementizers, prepositions, etc.). And while topical information did not always improve results significantly, syntactic features along with semantic class features always proved beneficial.</Paragraph> <Paragraph position="6"> Incorporating topical keywords as well as collocational, syntactic, and semantic local features, our system achieved 60.2% and 70.2% accuracy using fine-grained and coarse-grained scoring, respectively. This is in comparison to the next best-performing system, which had fine- and coarse-grained scores of 57.6% and 67.2% (Palmer et al., 2001). If we had not included a filter that only considered phrasal senses whenever there were satellites of multi-word constructions marked in the test data, our fine- and coarse-grained accuracy would have been reduced to 57.5% and 67.2% (significant ata5a7a6a9a8a11a10a12a8a14a13a15a8 ).</Paragraph> </Section> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Chinese Experiments </SectionTitle> <Paragraph position="0"> We chose 28 Chinese words to be sense-tagged.</Paragraph> <Paragraph position="1"> Each word had multiple verb senses and possibly draw, dress, drift, drive, face, ferret, find, keep, leave, live, match, play, pull, replace, see, serve, strike, train, treat, turn, use, wander, wash, work.</Paragraph> <Paragraph position="2"> other senses for other parts of speech, with an average of 6 dictionary senses per word. The first 20 words were chosen by randomly selecting several files totaling 5000 words from the 100K-word Penn Chinese Treebank, and choosing only those words that had more than one dictionary verb sense and that occurred more than three times in these files. The remaining 8 words were chosen by selecting all words that had more than one dictionary verb sense and that occurred more than 25 times in the CTB. The definitions for the words were based on the CETA (Chinese-English Translation Assistance) dictionary (Group, 1982) and other hard-copy dictionaries. Figure 1 shows an example dictionary entry for the most common sense of jian4. For each word, a sense entry in the lexicon included the definition in Chinese as well as in English, the part of speech for the sense, a typical predicate-argument frame if the sense is for a verb, and an example sentence. With these definitions, each word was independently sense-tagged by two native Chinese-speaking annotators in a double-blind manner. Sense-tagging was done primarily using raw text, without segmentation, part of speech, or bracketing information. After finishing sense tagging, the annotators met to compare and to discuss their results, and to modify the definitions if necessary. The gold standard sense-tagged files were then made after all this discussion.</Paragraph> <Paragraph position="3"> In a manner similar to our English approach, we included topical features as well as collocational, syntactic, and semantic local features in the maximum entropy models. Collocational features could be extracted from data that had been segmented into words and tagged for part of speech: a16 the target word a16 the part of speech tag of the target word a16 the words (if any) within 2 positions of the target word a16 the part of speech of the words (if any) immediately preceding and following the target word a16 whether the target word follows a verb </entry> Figure 1: Example sense definition for jian4. When disambiguating verbs, the following syntactic local features were extracted from data bracketed according to the Penn Chinese Treebank guidelines: null (any phrase labeled with &quot;-PRD&quot;) Semantic features were generated by assigning a HowNet2 noun category to each subject and object, and topical keywords were extracted as for English. Once all the features were extracted, a maximum entropy model was trained and tested for each target word. We used 5-fold cross validation to evaluate the system on each word. Two methods were used for partitioning a dataset of size a17 into five subsets: Select a17a19a18a15a13 consecutive occurrences for each set, or select every 5th occurrence for a set. In the end, the choice of partitioning method made little difference in overall performance, and we report accuracy as the precision using the latter (stratified) sampling method.</Paragraph> <Paragraph position="4"> tem using different subsets of features for Penn Chinese Treebank words (manually segmented, part-ofspeech-tagged, parsed).</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Penn Chinese Treebank </SectionTitle> <Paragraph position="0"> All sentences containing any of the 28 target words were extracted from the Penn Chinese Treebank, yielding between 4 and 1143 occurrence (160 average) for each of the target words. The manual segmentation, part-of-speech tags, and bracketing of the CTB were used to extract collocational and syntactic features.</Paragraph> <Paragraph position="1"> The overall accuracy of the system on the 28 words in the CTB was 94.4% using local collocational and syntactic features. This is significantly better than the baseline of 76.7% obtained by tagging all instances of a word with the most frequent sense of the word in the CTB. Considering only the 23 words for which more than one sense occurred in the CTB, overall system accuracy was 93.9%, compared with a baseline of 74.7%. Figure 2 shows the results broken down by word.</Paragraph> <Paragraph position="2"> As with the English data, we experimented with different types of features. Table 2 shows the performance of the system using different subsets of features. While the system's accuracy using syntactic features was higher than using only collocational accuracy and standard deviation using local collocational and syntactic features. ment was not as substantial as for English, and this was despite the fact that the Chinese bracketing was done manually and should be almost error-free.</Paragraph> <Paragraph position="3"> Semantic class information from HowNet yielded no improvement at all. To see if using a different ontology would help, we subsequently experimented with the ROCLing conceptual structures (Mo, 1992). In this case, we also manually added unknown nouns from the corpus to the ontology and labeled proper nouns with their conceptual structures, in order to more closely parallel the named entity information used in the English experiments.</Paragraph> <Paragraph position="4"> This resulted in a system accuracy of 95.0% (std.</Paragraph> <Paragraph position="5"> dev. 0.6), which again is not significantly better than omitting the noun class information.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 People's Daily News </SectionTitle> <Paragraph position="0"> Five of the CTB words (chu1, jian4, xiang3, hui1 fu4, yao4) had system performance of less than 80%, probably due to their low frequency in the CTB corpus. These words were subsequently sense tagged in the People's Daily News, a much larger corpus (about one million words) that has manual segmentation and part-of-speech, but no bracketing information.3 Those 5 words included all the words for which the system performed below the baseline tem using different subsets of features for People's Daily News words (manually segmented, part-ofspeech-tagged). null in the CTB corpus. About 200 sentences for each word were selected randomly from PDN and sense-tagged as with the CTB.</Paragraph> <Paragraph position="1"> We automatically annotated the PDN data to yield the same types of annotation that had been available in the CTB. We used a maximummatching algorithm and a dictionary compiled from the CTB (Sproat et al., 1996; Xue, 2001) to do segmentation, and trained a maximum entropy part-of-speech tagger (Ratnaparkhi, 1998) and TAG-based parser (Bikel and Chiang, 2000) on the CTB to do tagging and parsing.4 Then the same feature extraction and model-training was done for the PDN corpus as for the CTB.</Paragraph> <Paragraph position="2"> The system performance is much lower for the PDN than for the CTB, for several reasons. First, the PDN corpus is more balanced than the CTB, which contains primarily financial articles. A wider range of usages of the words was expressed in PDN than in CTB, making the disambiguation task more difficult; the average number of senses for the PDN words was 8.2 (compared to 3.5 for CTB), and the 4On held-out portions of the CTB, the accuracy of the segmentation and part-of-speech tagging are over 95%, and the accuracy of the parsing is 82%, which are comparable to the performance of the English preprocessors. The performance of these preprocessors is naturally expected to degrade when transferred to a different domain.</Paragraph> <Paragraph position="3"> baseline accuracy was 58.0% (compared to 76.7% for CTB). Also, using automatically preprocessed data for the PDN introduced noise that was not present for the manually preprocessed CTB. Despite these differences between PDN and CTB, the trends in using increasingly richer linguistic preprocessing are similar. Table 3 shows that adding more features from richer levels of linguistic annotation yielded no significant improvement over using only collocational features. In fact, using only lexical collocations from automatic segmentation was sufficient to produce close to the best results. Table 4 shows the system performance using the available manual segmentation and part-of-speech tagging. While using part-of-speech tags seems to be better than using only lexical collocations, the difference is not significant.</Paragraph> </Section> </Section> class="xml-element"></Paper>