File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/99/e99-1030_metho.xml

Size: 9,473 bytes

Last Modified: 2025-10-06 14:15:21

<?xml version="1.0" standalone="yes"?>
<Paper uid="E99-1030">
  <Title>The Development of Lexical Resources for Information Extraction from Text Combining WordNet and Dewey Decimal Classification*</Title>
  <Section position="3" start_page="0" end_page="225" type="metho">
    <SectionTitle>
2 Developing IE Lexical Resources
</SectionTitle>
    <Paragraph position="0"> Lexical information in IE can be divided into three sources of information (Kilgarriff, 1997): * an ontology, i.e. the templates to be filled; * the foreground lexicon (FL), i.e. the terms tightly bound to the ontology; * the background lexicon (BL), i.e. the terms not related or loosely related to the ontology.</Paragraph>
    <Paragraph position="1"> In this paper we focus on FL only.</Paragraph>
    <Paragraph position="2"> The FL has generally a limited size with respect to the average dictionary of a language; its dimension depends on each application needs, but it is generally limited to some hundreds of words. The level of quantitative and qualitative information for each entry in the FL can be very high and it is not transportable across domains and  applications, as it contains the mapping between the entries and the ontology. Generic dictionaries can contribute in identifying entries for the FL, but generally do not provide useful information for the mapping with the ontology. This mapping between words and ontology is generally to be built by hand. Most of the time in transporting the lexicon is spent in identifying and building FLs. Efficiently building FLs for applications means building the right FL (or at least a reasonable approximation of it) in a short time. The right FL contains those words that are necessary for the application and only those. The presence of all the relevant terms should guarantee that the information in the text is never lost; inserting just the relevant terms allows to limit the development effort, and should guarantee the system from noise caused by spurious entries in the lexicon.</Paragraph>
    <Paragraph position="3"> The BL could be seen as the complementary set of the FL with respect to the generic language, i.e. it contains all the words of the language that do not belong to the FL. In general the quantity of application specific information is small. Any machine readable dictionary can be to some extent seen as a BL. The transport of BL to new applications is not a problem, therefore it will not be considered in this paper.</Paragraph>
    <Section position="1" start_page="225" end_page="225" type="sub_section">
      <SectionTitle>
2.1 Using Generic Lexical Resources
</SectionTitle>
      <Paragraph position="0"> We propose a development methodology for FLs based on two steps: * Bootstrapping: manual or semi-automatic identification from the corpus of an initial lexicon (Core Lexicon), i.e. of the lexicon covering the corpus sample.</Paragraph>
      <Paragraph position="1"> * Consolidation: extension of the Core Lexicon by using a generic dictionary in order to completely cover the lexicon needed by the application but not exhaustively represented in the corpus sample.</Paragraph>
      <Paragraph position="2"> We propose to use WordNet (Miller, 1990) as a generic dictionary during the consolidation phase because it can be profitably used for integrating  the Core Lexicon by adding for each term in a semi-automatic way: * its synonyms; * hyponyms and (maybe) hypernyms; * some coordinated terms.</Paragraph>
      <Paragraph position="3">  As mentioned, there are two problems related to the use of generic dictionaries with respect to the IE needs.</Paragraph>
      <Paragraph position="4"> First there is no clear way of extracting from them the mapping between the FL and the ontology; this is mainly due to a lack of information and cannot in general be solved; generic lexica cannot then be used during the bootstrapping phase to generate the Core Lexicon.</Paragraph>
      <Paragraph position="5"> Secondly experience showed that the lexical ambiguity carried by generic dictionaries does not allow their direct use in computational systems (Basili and Pazienza, 1997; Morgan et al., 1995). Even when they are used off-line, lexical ambiguity can introduce so much noise (and then overhead) in the lexical development process that their use can be inconvenient from the point of view of efficiency and effectiveness.</Paragraph>
      <Paragraph position="6"> The next section explains how it is possible to cope with lexical ambiguity in WordNet by combining its information with another source of information: the Dewey Decimal Classification (DDC) (Dewey, 1989).</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="225" end_page="226" type="metho">
    <SectionTitle>
3 Reducing the lexical ambiguity
</SectionTitle>
    <Paragraph position="0"> in WordNet The main problem with the use of WordNet is lexical polysemy 1. Lexical polysemy is present when a word is associated to many senses (synsets). In general it is not easy to discriminate between different synsets. It is then necessary to find a way for helping the lexicon developer in selecting the correct synset for a word.</Paragraph>
    <Paragraph position="1"> In order to cope with lexical polysemy, we propose to integrate WordNet synsets with an additional information: a set of field labels. Field labels are indicators, generally used in dictionaries, which provide information about the use of the word in a semantic field. Semantic fields are sets of words tied together by &amp;quot;similarity&amp;quot; covering the most part of the lexical area of a specific domain.</Paragraph>
    <Paragraph position="2"> Marking synsets with field labels has a clear advantage: in general, given a polysemous word in WordNet and a particular field label, in most of the cases the word is disambiguated. For example Security is polysemous as it belongs to 9 different synsets; only the second one is related to the economic domain. If we mark this synset with the field label Economy, it is possible to disambiguate the term Security when analyzing texts in an economic context. Note that WordNet being a hierarchy, marking a synset with a field label means also marking all its sub-hierarchy with such field label. In the Security example, if we mark the second synset with the field label Economy we also associate the same field label to the synonym Certificate, to the 13 direct hyponyms and to the 27 1 Actually the problem is related to both polysemy and omonymy. As WordNet does not distinguish between them, we will use the term polysemy for referring to both.</Paragraph>
    <Paragraph position="3">  indirect ones; moreover we can also inspect its co-ordinated terms and assign the same label to 9 of the 33 coordinate terms (and then to their direct and indirect hyponyms). Marking is equivalent to assigning WordNet synsets to sets each of them referring to a particular semantic field. Marking the structure allows us to solve the problem of choosing which synsets are relevant for the domain. Associating a domain (e.g., finance) to one or more field labels should allow us to determine in principle the synsets relevant for the domain.</Paragraph>
    <Paragraph position="4"> It is possible to greatly reduce the ambiguity implied by the use of WordNet by finding the correct set of field labels that cover all the WordNet hierarchy in an uniform way. Therefore we can reduce the overhead in building the FL using WordNet.</Paragraph>
    <Paragraph position="5"> Our assumption is that using semantic fields taken from the DDC 2 , all the possible domains can then be covered. This is because the first ten classes of the DDC (an extract is shown in figure 1) exhaust the traditional academic disciplines and so they also cover the generic knowledge of the world. The integration consists in marking parts of WordNet's hierarchy, i.e. some synsets, with semantic labels taken from the DDC.</Paragraph>
  </Section>
  <Section position="5" start_page="226" end_page="226" type="metho">
    <SectionTitle>
4 The development cycle using
</SectionTitle>
    <Paragraph position="0"> WN-PDDC The consolidation phase mentioned in section 2.1 can be integrated with the use of the WN+DDC  at the broadest level, it classifies concepts into ten main classes, which cover the entire world of knowledge. null as generic resource (see figure 2). Before starting the development, the set of field labels relevant for the application must be identified. Then the Core Lexicon is identified in the usual way.</Paragraph>
    <Paragraph position="1"> Using WN+DDC it is possible for each term in the Core Lexicon to: * identify the synsets the term belongs to; ambiguities are reduced by applying the intersection of the field labels chosen for the current application and those associated to the possible synsets.</Paragraph>
    <Paragraph position="2"> * integrate the Core Lexicon by adding, for each term: synonyms in the synsets, hyponyms and (maybe) hypernyms and some coordinated terms.</Paragraph>
    <Paragraph position="3"> The proposed methodology is corpus centered (starting from the corpus analysis to build the Core Lexicon) and can always be profitably applied. It also provides a criterion for building lexical resources for specific domains. It can be applied in a semiautomatic way. It has the advantage of using the information contained in Word-Net for expanding the FL beyond the corpus limitations, keeping under control the ambiguity implied by the use of a generic resource.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML