File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/92/a92-1012_intro.xml

Size: 12,585 bytes

Last Modified: 2025-10-06 14:05:05

<?xml version="1.0" standalone="yes"?>
<Paper uid="A92-1012">
  <Title>The ACQUILEX LKB: representation issues in semi-automatic acquisition of large lexicons</Title>
  <Section position="3" start_page="0" end_page="89" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> The ACQUILEX LKB is designed to support representation of multilingual lexical information extracted from machine readable dictionaries (MRDs) in such a way that it can be utilised by NLP systems. In contrast to lexical database systems (LDBs) or thesaurus-like representations (e.g. Alshawi et al., 1989; Calzolari, 1988) which represent extracted data in such a way as to support browsing and querying, our goal is to build a knowledge base which can be used as a highly structured reusable lexicon, albeit one much richer in lexical semantic information than those commonly used in NLP. Thus, although we are using information which has been derived from MRDs (possibly after considerable processing involving some human intervention), our aim is not to represent the dictionary entries themselves.</Paragraph>
    <Paragraph position="1"> Our methodology is to store the dictionary entries and raw extracted data in our LDB (Carroll, 1990) and to use this information to build LKB entries which could be directly utilised by an NLP system. Briscoe (1991) discusses the LDB/LKB distinction in more detail and describes the ACQUILEX project as a whole.</Paragraph>
    <Paragraph position="2"> 1'The Acquisition of lexical knowledge for Natural Language Processing systems' (Esprit BRA-3030) Practical NLP systems need large lexicons. Even in cases such as database front ends, where the domain of the application is highly restricted, a practical natural language interface must be able to cope with an extensive vocabulary, in order to respond helpfully to a user who lacks domain knowledge, for example. For applications such as text-to-speech synthesis, interfaces to large-scale knowledge based systems, summarising and so on, large lexicons are clearly needed; for machine translation the requirement is for a large scale, multilingua\] lexical resource. Acquisition of such information is a serious bottleneck in building NLP systems, and MRD sources currently seem the most promising source for semi-automatically acquiring the syntactic and semantic information needed.</Paragraph>
    <Paragraph position="3"> Previous work on extracting and representing syntactic information includes the work done on the Alvey Tools lexicon project (Carroll and Grover 1989) in which a large scale lexicon was produced semi-automatically from LDOCE (Longman Dictionary of Contemporary English, Procter, 1978) using a feature and unification based representation. There has been considerable discussion and some implementation of LKBs for the representation of semantic information extracted from MRD~, (e.g. Boguraev and Levin, 1990; Wilks et ai, 1989). However the knowledge representation languages assumed are rarely described formally; typically a semantic network or a frame representation has been suggested, but the interpretation and functionality of the links has been left vague. Several networks based on taxonomies have been built, and these are useful for tasks such as sensedisambiguation, but are not directly utilisable as NL1 c lexicons. For a reusable lexicon, a declarative, formallb specified, representation language is essential.</Paragraph>
    <Paragraph position="4"> In the ACQUILEX project we are concerned with the extraction and representation of both syntactic and lexical semantic information. A common representation language is needed, to allow the interaction of lexical semantic and syntactic properties to be described. Ther~ is currently a considerable amount of work in lexical semantics where unification based formalisms are used tc represent this interaction (e.g. Briscoe et al.'s (1990' account of logical metonymy (Pustejovsky, 1989, 1991) Sanfilippo's (1990) representation of thematic and aspec.</Paragraph>
    <Paragraph position="5"> tual information). However we also wish to structure th~ lexicon, in order to link lexical entries. This is essential  since we are ultimately considering lexicons with maybe 100,000 entries for each language. Although the aim of the ACQUILEX project is to determine the feasibility of using MRD sources, rather than attempting to build a lexicon of such size, we nevertheless need an LKB which can cope with tens of thousands of entries.</Paragraph>
    <Paragraph position="6"> There are currently several approaches to developing representation languages which allow the lexicon to be structured, in particular by inheritance. These include object-oriented approaches (Daelemans, 1990), and DATR (Evans and Gazdar, 1990). We chose to use a graph unification based representation language for the LKB, because this offered the flexibility to represent both syntactic and semantic information in a way which could be easily integrated with much current work on unification grammar, parsing and generation. In contrast to DATR for example, the LKB's representation language (LRL) is not specific to lexical representation.</Paragraph>
    <Paragraph position="7"> This made it much easier to incorporate a parser in the LKB (for testing lexical entries) and to experiment with notions such as lexical rules and interlingual links between lexical entries. Although this means that the LRL is in a sense too general for its main application, the typing system provides a way of constraining the representations, and the implementation can then be made more efficient by taking advantage of such constraints.</Paragraph>
    <Paragraph position="8"> Our typed feature structure mechanism is based on Carpenter's work on the HPSG formalism (Carpenter 1990) although there are some significant differences.</Paragraph>
    <Paragraph position="9"> We augment the formalism with the more flexible psort inheritance mechanism, which allows for default inheritance. Much of the motivation behind this comes from consideration of the sense-disambiguated taxonomies semi-automatically derived from MRDs, which we are using to structure the LKB (see Copestake 1990). The notion of types, and features appropriate for a given type, gives some of the properties of frame representation languages, and allows us to provide a well-defined, declarative representation, which integrates relatively straightforwardly with much current work on natural language processing and lexical semantics.</Paragraph>
    <Paragraph position="10"> Thus the operations that the LKB supports are (default) inheritance, (default) unification, and lexical rule application. It does not support any more general forms of inference and is thus designed specifically to support processes which concern lexical rather than general reasoning. In the rest of this paper we first informally introduce the way in which lexical entries are represented in the LKB. We then describe the LRL, and discuss how the design of the default inheritance system was influenced by the application. (A fuller and more formal account of the LRL appears in papers in Briscoe et al., forthcoming.) We conclude with an overview of the actual implementation and a discussion of the utility of typed feature structures and the psort mechanism in practise.</Paragraph>
    <Paragraph position="11"> 2 Lexical entries in the LKB Consider Figure 1, which is a screen dump of the LKB system showing part of a file containing a semi-automatically generated lexical entry for the Dutch noun kippevlees (chicken meat) (top right of figure), the feature structure (FS) representation of that description (bottom right) and the fully typed feature structure into which it is expanded in the LKB (left of figure).</Paragraph>
    <Paragraph position="12"> See Vossen (1991) for the details of the generation of this entry. Features are shown uppercased, types are in lowercase bold, reentrancy is indicated by numbers in angle brackets. The identifier for the lexical entry, kippevlees_V_O_l, indicates that it corresponds to the sense kippevlees 1 in the Van Dale dictionary. The unexpanded lexical entry is relatively compact, but a large amount of information is inherited via the type and psort systems. The expanded lexical entry is not shown completely; the entry's syntactic type is noun-cat, and the box round this indicates that its internal structure is not displayed. The same applies to the sense-id information (which enables the corresponding LDB entry to be accessed) and the argument structure. Figure 2 (left) shows the type lex-uncount-noun, which determines the basic skeleton of the entry. Feature structures (called constraints) are associated with types and inherited by all FSs of a particular type. Thus the form of this lexical entry is due to the constraint on lex-uncount-noun shown in the figure.</Paragraph>
    <Paragraph position="13"> Default inheritance from the lexical semantic structure for the lexical entry for vlees_V_0_l augments the type information for the entry for kippevlees. We encode a relatively rich lexical semantic structure for nouns (referred to as the 'relativised qualia structure', RQS) based on the notion of qualia structure, described by Pustejovsky (1989, 1991). Noun lexical entries are parsed to yield a genus term, vlees in this case, and differentia.</Paragraph>
    <Paragraph position="14"> The genus term is normally interpreted in LKB terms as specifying the lexical entry from which information is inherited by default; as explained in Section 4 this also partially defines the lexical semantic type (RQS type), which is c_nat_subst in this example (for comestible, natural, substance). A fragment of the RQs type hierarchy is also shown in Figure 2 2. The differentia can be partially interpreted relative to the RQS type; in this example &lt; rqs : origin &gt; = &amp;quot;kip&amp;quot; is an indication that kippevlees comes from kip; eventually this will allow their lexical entries to be linked automatically by the appropriate lexical rule (Copestake and Briscoe, 1991; Copestake et al., 1992). The feature ORIGIN is introduced at type natural (the FS definition of natural is shown in Figure 2, top right, before expansion by inheritance from nomrqs). Since natural is a parent of c_nat_subst, ORIGIN is an appropriate feature for c_nat._subst.</Paragraph>
    <Paragraph position="15"> The feature TELIC is used to provide a slot for the semantics of the verb sense which is associated with the purpose of an entity (eating in this case). The way in which such a representation may be used in the treatment of logical metonymy was described in Briscoe et al (1990). Other features (such as PHYSICAL-STATE) are used to encode information which is useful for applications such as sense-disambiguation. This attempt to represent detailed lexical semantic information illustrates a general principle of the ACQUILEX project; such lexical 2Unlike Carpenter(1990) we adopt a notation with the most general type at the top of any diagram, because this seems more natural to the main users of the system.</Paragraph>
    <Paragraph position="16">  II File Edit Find Windows Tools Preferences Ldb Lkb kippevlees - ex landed ',_ip.p~,~ee~ van(l-food.lex lex-uncount-noun k ippev lees U-O_1 ORTH:kippevlees &lt; sense-ld : diotlonaPu &gt; = &amp;quot;URHD&amp;quot; CAT:~--~ &lt; sense-id : Idb-entrtj-no &gt; = &amp;quot;16810605&amp;quot; SEM'. pnm2f-lfemula-entity-eqillJ &lt; sense- i d : homonym-no &gt; - &amp;quot;0&amp;quot; SENSE-ID:lsense-id j &lt; sense-id : sense-no &gt; - &amp;quot;!&amp;quot; RQS:\[c_n~t._subst &lt; r'qs : or-igln &gt; = (&amp;quot;kip&amp;quot;) ORIGIN~REA: st~ing &lt; I ex-uncount-noun rqs &gt; \]I_=LIC: \[strict-trams--.;em &lt; ULEES_U_D_I &lt; lex-noun-s ign r'qs &gt;. IND: &lt;0&gt; = e~e  entries are usable by a wide range of NLP systems because they are relatively rich and detailed; applications which do not make use of detailed lexicM semantic information can simply discard the information. Clearly the converse is not true, and a more impoverished representation would be less generally useful. We thus aim for representations which are as rich as possible in information which we can extract automatically, and represent formally, but which are also well motivated linguistically and/or useful for practical NLP applications. This also applies to our use of thematic roles in the semantics; see the examples of LKB entries for verbs given in Sanfilippo and Poznanski (1992, this volume).</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML