File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/99/e99-1030_intro.xml
Size: 2,829 bytes
Last Modified: 2025-10-06 14:06:51
<?xml version="1.0" standalone="yes"?> <Paper uid="E99-1030"> <Title>The Development of Lexical Resources for Information Extraction from Text Combining WordNet and Dewey Decimal Classification*</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> One of the current issues in Information Extraction (IE) is efficient transportability, as the cost of new applications is one of the factors limiting the market. The lexicon definition process is currently one of the main bottlenecks in producing applications. As a matter of fact the necessary lexicon for an average application is generally large (hundreds to thousands of words) and most lexical information is not transportable across domains.</Paragraph> <Paragraph position="1"> The problem of lexicon transport is worsened by the growing degree of lexicalization of IE systems: nowadays several successful systems adopt lexical rules at many levels.</Paragraph> <Paragraph position="2"> The IE research mainstream focused essentially on the definition of lexica starting from a corpus sample (Riloff, 1993; Grishman, 1997) with the implicit assumption that a corpus provided for an application is representative of the whole applica*This work was carried on at ITC-IRST as part of the author's dissertation for the degree in Philosophy (University of Turin, supervisor: Carla Bazzanella).</Paragraph> <Paragraph position="3"> The author wants to thank her supervisor at ITC-IRST, Fabio Ciravegna, for his constant help. Alberto Lavelli provided valuable comments to the paper.</Paragraph> <Paragraph position="4"> tion requirement. Unfortunately one of the current trends in IE is the progressive reduction of the size of training corpora: e.g., from the 1,000 texts of the MUC-5 (MUC-5, 1993) to the 100 texts in MUC-6 (MUC-6, 1995). When the corpus size is limited, the assumption of lexical representativeness of the sample corpus may not hold any longer, and the problem of producing a representative lexicon starting from the corpus lexicon arises (Grishman, 1995).</Paragraph> <Paragraph position="5"> Generic resources are interesting as they contain (among others) most of the terms necessary for an IE application. Nevertheless up to now the use of generic resources within IE system has been limited for two main reasons. First the information associated to each term is often not detailed enough for describing the relations necessary for a IE lexicon; secondly the presence of a large amount of lexical polysemy.</Paragraph> <Paragraph position="6"> In this paper we propose a methodology for semi-automatically developing the relevant part of a lexicon (foreground lexicon) for IE applications by using both a small corpus and WordNet.</Paragraph> </Section> class="xml-element"></Paper>