File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/relat/04/p04-1039_relat.xml
Size: 2,596 bytes
Last Modified: 2025-10-06 14:15:45
<?xml version="1.0" standalone="yes"?> <Paper uid="P04-1039"> <Title>Relieving The Data Acquisition Bottleneck In Word Sense Disambiguation</Title> <Section position="4" start_page="0" end_page="0" type="relat"> <SectionTitle> 2 Related Work </SectionTitle> <Paragraph position="0"> To our knowledge, the earliest study of bootstrapping a WSD system with noisy data is by Gale et.</Paragraph> <Paragraph position="1"> al., (Gale et al. , 1992). Their investigation was limited in scale to six data items with two senses each and a bounded number of examples per test item.</Paragraph> <Paragraph position="2"> Two more recent investigations are by Yarowsky, (Yarowsky, 1995), and later, Mihalcea, (Mihalcea, 2002). Each of the studies, in turn, addresses the issue of data quantity while maintaining good quality training examples. Both investigations present algorithms for bootstrapping supervised WSD systems using clean data based on a dictionary or an ontological resource. The general idea is to start with a clean initial seed and iteratively increase the seed size to cover more data.</Paragraph> <Paragraph position="3"> Yarowsky starts with a few tagged instances to train a decision list approach. The initial seed is manually tagged with the correct senses based on entries in Roget's Thesaurus. The approach yields very successful results -- 95% -- on a handful of data items.</Paragraph> <Paragraph position="4"> Mihalcea, on the other hand, bases the bootstrapping approach on a generation algorithm, GenCor (Mihalcea&Moldovan, 1999). GenCor creates seeds from monosemous words in WordNet, Semcor data, sense tagged examples from the glosses of polysemous words in WordNet, and other hand tagged data if available. This initial seed set is used for querying the Web for more examples and the retrieved contexts are added to the seed corpus. The words in the contexts of the seed words retrieved are then disambiguated. The disambiguated contexts are then used for querying the Web for yet more examples, and so on. It is an iterative algorithm that incrementally generates large amounts of sense tagged data. The words found are restricted to either part of noun compounds or internal arguments of verbs. Mihalcea's supervised learning system is an instance-based-learning algorithm. In the study, Mihalcea compares results yielded by the supervised learning system trained on the automatically generated data, GenCor, against the same system trained on manually annotated data. She reports successful results on six of the data items tested.</Paragraph> </Section> class="xml-element"></Paper>