File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/w06-1663_intro.xml
Size: 3,686 bytes
Last Modified: 2025-10-06 14:04:00
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-1663"> <Title>Quality Assessment of Large Scale Knowledge Resources</Title> <Section position="3" start_page="0" end_page="534" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Using large-scale semantic knowledge bases, such as WordNet (Fellbaum, 1998), has become a usual, often necessary, practice for most current Natural Language Processing systems. Even now, building large and rich enough knowledge bases for broad-coverage semantic processing takes a great deal of expensive manual effort involving large research groups during long periods of development. This fact has severely hampered the state-of-the-art of current Natural Language Processing (NLP) applications. For example, dozens of person-years have been invested in the development of wordnets for various languages (Vossen, 1998), but the data in these resources seems not to be rich enough to support advanced concept-based NLP applications directly. It seems that applications will not scale up to working in open domains without more detailed and rich general-purpose (and also domain-specific) linguistic knowledge built by automatic means.</Paragraph> <Paragraph position="1"> For instance, in more than eight years of manual construction (from version 1.5 to 2.0), Word-Net passed from 103,445 semantic relations to 204,074 semantic relations1. That is, around twelve thousand semantic relations per year. However, during the last years the research community has devised a large set of innovative processes and tools for large-scale automatic acquisition of lexical knowledge from structured or unstructured corpora. Among others we can mention eXtended WordNet (Mihalcea and Moldovan, 2001), large collections of semantic preferences acquired from SemCor (Agirre and Martinez, 2001; Agirre and Martinez, 2002) or acquired from British National Corpus (BNC) (McCarthy, 2001), large-scale Topic Signatures for each synset acquired from the web (Agirre and de la Calle, 2004) or acquired from the BNC (Cuadros et al., 2005).</Paragraph> <Paragraph position="2"> Obviously, all these semantic resources have been acquired using a very different set of methods, tools and corpora, resulting on a different set of new semantic relations between synsets. In fact, each resource has different volume and accuracy figures. Although isolated evaluations have been performed by their developers in different experi- null mental settings, to date no comparable evaluation has been carried out in a common and controlled framework.</Paragraph> <Paragraph position="3"> This work tries to establish the relative quality of these semantic resources in a neutral environment. The quality of each large-scale knowledge resource is indirectly evaluated on a Word Sense Disambiguation (WSD) task. In particular, we use a well defined WSD evaluation benchmark (Senseval-3 English Lexical Sample task) to evaluate the quality of each resource.</Paragraph> <Paragraph position="4"> Furthermore, this work studies how these resources complement each other. That is, to which extent each knowledge base provides new knowledge not provided by the others.</Paragraph> <Paragraph position="5"> This paper is organized as follows: after this introduction, section 2 describes the large-scale knowledge resources studied in this work. Section 3 describes the evaluation framework. Section 4 presents the evaluation results of the different semantic resources considered. Section 5 provides a qualitative assessment of this empirical study and finally, the conclusions and future work are presented in section 6.</Paragraph> </Section> class="xml-element"></Paper>