File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/02/c02-1143_intro.xml
Size: 4,280 bytes
Last Modified: 2025-10-06 14:01:23
<?xml version="1.0" standalone="yes"?> <Paper uid="C02-1143"> <Title>Simple Features for Chinese Word Sense Disambiguation</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Word Sense Disambiguation (WSD) is a central open problem at the lexical level of Natural Language Processing (NLP). Highly ambiguous words pose continuing problems for NLP applications. They can lead to irrelevant document retrieval in Information Retrieval systems, and inaccurate translations in Machine Translation systems (Palmer et al., 2000). For example, the Chinese word (jian4) has many different senses, one of which can be translated into English as &quot;see&quot;, and another as &quot;show&quot;. Correctly sense-tagging the Chinese word in context can prove to be highly beneficial for lexical choice in Chinese-English machine translation.</Paragraph> <Paragraph position="1"> Several efforts have been made to develop automatic WSD systems that can provide accurate sense tagging (Ide and Veronis, 1998), with a current emphasis on creating manually sense-tagged data for supervised training of statistical WSD systems, as evidenced by SENSEVAL-1 (Kilgarriff and Palmer, 2000) and SENSEVAL-2 (Edmonds and Cotton, 2001). Highly polysemous verbs, which have several distinct but related senses, pose the greatest challenge for these systems (Palmer et al., 2001). Predicate-argument information and selectional restrictions are hypothesized to be particularly useful for disambiguating verb senses.</Paragraph> <Paragraph position="2"> Maximum entropy models can be used to solve any classification task and have been applied to a wide range of NLP tasks, including sentence boundary detection, part-of-speech tagging, and parsing (Ratnaparkhi, 1998). Assigning sense tags to words in context can be viewed as a classification task similar to part-of-speech tagging, except that a separate set of tags is required for each vocabulary item to be sense-tagged. Under the maximum entropy framework (Berger et al., 1996), evidence from different features can be combined with no assumptions of feature independence. The automatic tagger estimates the conditional probability that a word has sense a2 given that it occurs in context a3 , where a3 is a conjunction of features. The estimated probability is derived from feature weights which are determined automatically from training data so as to produce a probability distribution that has maximum entropy, under the constraint that it is consistent with observed evidence. With existing tools for learning maximum entropy models, the bulk of our work is in defining the types of features to look for in the data. Our goal is to see if sense-tagging of verbs can be improved by combining linguistic features that capture information about predicate-arguments and selectional restrictions.</Paragraph> <Paragraph position="3"> In this paper we report on our experiments on automatic WSD using a maximum entropy approach for both English and Chinese verbs. We compare the difficulty of the sense-tagging tasks in the two languages and investigate the types of contextual features that are useful for each language. We find that while richer linguistic features are useful for English WSD, they do not prove to be as beneficial for Chinese.</Paragraph> <Paragraph position="4"> The maximum entropy system performed competitively with the best systems on the English verbs in SENSEVAL-1 and SENSEVAL-2 (Dang and Palmer, 2002). However, while SENSEVAL-2 made it possible to compare many different approaches over many different languages, data for the Chinese lexical sample task was not made available in time for any systems to compete. Instead, we report on two experiments that we ran using our own lexicon and two separate Chinese corpora that are very similar in style (news articles from the People's Republic of China), but have different types and levels of annotation - the Penn Chinese Treebank (CTB)(Xia et al., 2000), and the People's Daily News (PDN) corpus from Beijing University. We discuss the utility of different types of annotation for successful automatic word sense disambiguation.</Paragraph> </Section> class="xml-element"></Paper>