File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/92/a92-1012_metho.xml
Size: 24,061 bytes
Last Modified: 2025-10-06 14:12:54
<?xml version="1.0" standalone="yes"?> <Paper uid="A92-1012"> <Title>The ACQUILEX LKB: representation issues in semi-automatic acquisition of large lexicons</Title> <Section position="4" start_page="89" end_page="90" type="metho"> <SectionTitle> 3 The type system </SectionTitle> <Paragraph position="0"> In the definition of a type hierarchy we follow Carpenter(1990) very closely. The type hierarchy defines a partial order (notated E_, &quot;is more specific than&quot;) on the types and specifies which types are consistent. Only FSs with mutually consistent types can be unified -two types which are unordered in the hierarchy are assumed to be inconsistent unless the user explicitly specifies a common subtype. Every consistent set of types S C_ TYPE must have a unique greatest lower bound or meet (notation \[7S). 3 This condition allows FSs to be typed deterministically -- if two FSs of types a and b are unified the type of the result will be a \[7 b, which must be unique if it exists. If a ~ b does not exist unification fails. In the fragment of a type hierarchy shown in Figure 2 c__natural and natural..substance are consistent; c_natural R naturaLsubstance --- c_nat_subst Because the type hierarchy is a partial order it has properties of reflexivity, transitivity and anti-symmetry (frorr which it follows that the type hierarchy cannot contair cycles).</Paragraph> <Paragraph position="1"> We define a typed feature structure as a tuple F = (Q, q0,/f, 0), where the only difference from the untypec case is that every node of a typed FS has a type, 6(q) The type of a FS is the type of its initial node, O(qo) The definition of subsumption of typed FSs is very sire liar to that for untyped FSs, with the additional provis( that the ordering must be consistent with the ordering oI their types. We thus overload the symbol E (&quot;is-more.</Paragraph> <Paragraph position="2"> 3In order to check the type hierarchy for uniqueness o greatest lower bounds we carry out a p~irwise comparison el types with multiple parents to see if they have a unique low est greater bound. Since the number of types with multipb parents is typically much less than the totaJ number of types this is considerably more efficient than carrying out p~irwis, comparisons on all the types in the hierarchy.</Paragraph> <Paragraph position="3"> specific-than&quot;, &quot;is-subsumed-by&quot;) to express subsumption of FSs as well as the ordering on the type hierarchy. Thus if/'1 and F2 are FSs of types tl and t~ respectively, then F1 E F~ only if tl E t2.</Paragraph> <Section position="1" start_page="90" end_page="90" type="sub_section"> <SectionTitle> 3.1 Constraints </SectionTitle> <Paragraph position="0"> Our system differs somewhat from that described by Carpenter in that we adopt a different notion of well-formedness of typed feature structures. In our system every type must have exactly one associated FS which acts as a constraint on all FSs of that type; by subsuming all well-formed FSs of that type. The constraint also defines which features are appropriate for a particular type; a well-formed FS may only contain appropriate features.</Paragraph> <Paragraph position="1"> Constraints are inherited by all subtypes of a type, but a subtype may introduce new features (which will be inherited as appropriate features by all its subtypes). A constraint on a type is a well-formed FS of that type; all constraints must therefore be mutually consistent.</Paragraph> <Paragraph position="2"> Features may only be introduced at one point in the type hierarchy (cf Carpenter's minimal introduction).</Paragraph> <Paragraph position="3"> Because of the condition that any consistent set of types must have a unique greatest lower bound, it is also the case that sets of features will become valid at unique greatest points in the type hierarchy. This allows undertyped feature structures to be introduced into the system by the user which are then given the most general possible type. The importance of this form of type inference for our application is discussed in Section 5.2, below.</Paragraph> <Paragraph position="4"> Constraints are given by the function C: (TYPE, E:) --, .Zwhere ~&quot; is the set of FS. C(t) denotes the constraint FS associated with type t. We define the notion of appropriate features as follows: Definition 1 /f C(t) = (Q, qo, 6, O) we define Appfeat(t) = reat(qo) where we define Feat(q) to be the set of features labelling transitions from the node q such that f e Feat(q) if 6(f, q) is defined.</Paragraph> <Paragraph position="5"> The conditions on the constraint function are as follows: null We therefore disallow any occurrence of t in a sub-structure of C(t), thus if C(t) = (Q, q0, ~, 0) then for all q E Q, q C/ q0 implies that 0(q) # t. Since we disallow cycles in FSs such a constraint could only be satisfied by an infinite FS, which is also disallowed.</Paragraph> <Paragraph position="6"> Maximal introduction of features For every feature f E FEAT there is a unique type t = Maztype(f) such that f E Appfeat(t) and there is no type s such that t F- s and f E Appfeat(s). The maximal appropriate value of a feature Mazappval(f) is the type t such that if C(Maztype(f)) = (Q, qo, 6, 0> then t = 0(~(f, q0)) Definition 2 We say that a given FS F = (Q, qo, 6, 0> is a well-formed FS iff for all q E Q, we have that F' = (Q', q, 6, 0) E C(O(q)) and Feat(q) = Appfeat(O(q)).</Paragraph> <Paragraph position="7"> Carpenter separates the notion of typing and constraints. This allows a more powerful constraint language, but complicates the system. Since the users of the LKB were initially not familiar with feature structure representations it was important to keep the system as simple as possible, and in practice we have not yet found the additional power of Carpenter's constraint language necessary.</Paragraph> <Paragraph position="8"> Some relatively minor extensions to the formalism allow the implementation of some cooccurrence restrictions and the disjunction of atomic types. It is necessary to allow types with string values, representing orthography for example, to be introduced as needed rather than predefined; we therefore define an atomic type string which is allowed to have any string as a subtype without these being explicitly specified. All subtypes of string are taken to be disjoint.</Paragraph> </Section> </Section> <Section position="5" start_page="90" end_page="92" type="metho"> <SectionTitle> 4 Default inheritance and taxonomies </SectionTitle> <Paragraph position="0"> We extend the typed FS system with default inheritance.</Paragraph> <Paragraph position="1"> FSs may be specified as inheriting by default from one or more other (well-formed) FSs which we refer to in this context as psorts. Psorts may correspond to (parts of) lexical entries or be specially defined. Since psorts may themselves inherit information, default inheritance (notated by <, &quot;inherits from&quot;) in effect operates over a hierarchy of psorts. We prohibit cycles in the inheritance ordering. Inheritance order must correspond to the type hierarchy order.</Paragraph> <Paragraph position="2"> Pl < P2 :=~ Typeof(pl) E Typeof(p2) where Pl and P2 are psorts The typing system thus restricts default inheritance essentially to the filling in of values for features which are defined by the type system.</Paragraph> <Paragraph position="3"> Default inheritance is implemented by a version of default unification, for a detailed discussion of which see Carpenter (1991, forthcoming). In default unification, unlike ordinary unification, inconsistent information is ignored rather than causing failure; however the definition is complicated by the need to consider the interactions between reentrant FSs. The way we deal with this is discussed in detail in Copestake(1991, forthcoming), but since the problematic cases seem to arise relatively rarely in our particular application, we will not discuss the full definition here. We use Iq< to signify default unification, where A R< B means that A is the non-default and B the default FS. When no reentrancy interactions are involved the definition is: AM< B = AM \['3{C/ E q/ I A~C/ #+-} where @ is the set of all component FSs of B.</Paragraph> <Paragraph position="4"> The ordering on the psort hierarchy gives us an ordering on defaults. So for example, assume that the following is the lexical entry for BOOK_L_I_I: PHYSICAL-STATE but the value of TELIC overrides tha~ inherited from BOOK_L_I_ 1. LEXICON inherits its valu, for the telic role from DICTIONARY rather than fron BOOK_L_I_I:</Paragraph> <Paragraph position="6"> Multiple default inheritance is allowed but is restricte, to the case where the information from the parent psort does not conflict. This is enforced by unifying all (full~ expanded) immediate parent psorts before default un! fying the result with the daughter psort. The type re striction on default inheritance means that all the psort must have compatible types and the type of the daughte must be the meet of those types. We define inheritanC/ to operate top-down; that is a psort will be fully e~ panded with inherited information before it is used fc default inheritance. We adopted this approach as are primarily interested in default inheritance betwee fully formed lexical entries; since we disallow conflict arising from multiple inheritance, distinctions betwee top-down and bottom-up inheritance only arise with tl~ problematic cases of default unification alluded to abov, We also allow non-default inheritance from psorts, in plemented by ordinary unification. This is a relative\] recent addition to the LKB, prompted partly by issues in the representation of the multilingual translation links. It also seemed to be desirable in the representation of qualia structure, in order to allow the telic role of a noun to be specified directly in terms of a verb sense, without allowing other information in that lexical entry to conflict. Thus the entry for dictionary above would actually specify: <rqs : relic > =-- refer_to_L 0_2 < sem > where == indicates non-default inheritance.</Paragraph> <Paragraph position="7"> Although introducing psorts as well as types may seem unnecessarily complex there seem to be compelling reasons for doing so for this application, where we wish to use taxonomic information extracted from MRDs to structure the lexicon. The type hierarchy is not a suitable way for representing taxonomic inheritance for several reasons. Perhaps the most important is that taxonomically inherited information is defeasible, but typing and defaults are incompatible notions. Types are needed to enforce an organisation on the lexicon -- if this can be overridden it is useless. Furthermore the type system is taken to be complete, and various conditions are imposed on it, such as the greatest lower bound condition, which ensure that deterministic classification is possible. Taxonomies extracted from dictionaries will not be complete in this sense, and will not meet these conditions. Intuitively we would expect to be able to classify lexical entries into categories such as human, artifact and so on, and to be able to state that all creatures are either humans or animals, since in effect this is how we are defining those types. But we would not expect to be able to use the finer-grained, automatically acquired information in this way; we will never extract all possible categories of horse for example.</Paragraph> <Paragraph position="8"> In implementational terms, using the type hierarchy to provide the fine-grain of inheritance possible with taxonomic information would be very difficult.</Paragraph> <Paragraph position="9"> A type scheme should be relatively static; any alterations may affect a large amount of data and checking that the scheme as a whole is still consistent is a non-trivial process. Because the inheritance hierarchies are derived from taxonomies and thus are derived semi-automatically from MRDs, they will contain errors and it is important that these can be corrected easily. In practise, deciding whether to make use of the type mechanism or the psort mechanism has been relatively straightforward. If we wish to use a feature which is particular to some group of lexical entries we have to introduce a type, otherwise, especially if the information might be defeasible, we use a psort.</Paragraph> <Paragraph position="10"> Several of the decisions involved in designing the default inheritance system were thus influenced by the application. The condition that the default inheritance ordering reflects the type ordering was partly motivated by the desire to be able to provide an rqs type for lexical entries on the basis of taxonomic data alone. However it also seems intuitively reasonable as a way of restricting default inheritance; without some such restriction it is difficult to make any substantive claims when default inheritance is used to model some linguistic phenomenon.</Paragraph> </Section> <Section position="6" start_page="92" end_page="94" type="metho"> <SectionTitle> 5 Using the LKB </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="92" end_page="93" type="sub_section"> <SectionTitle> 5.1 Interface and implementation </SectionTitle> <Paragraph position="0"> The LKB as described here is fully implemented in Procyon Common Lisp running on Apple Macintoshes. It is in use by all the academic groups involved in the ACQUILEX project. In total there are currently about 20 users on five sites in different countries. Interaction with the LKB is entirely menu-driven. Besides the obvious functions to load and view types, lexical entries, psorts, lexical rules and so on, there are various other facilities which are necessary for the application. A very simple (and inefficient) parser is included, to aid development of types and lexical entries. There are tools for supporting multilingual linked lexicons, described in Copestake et al. (1992). The LKB is integrated with our LDB system so that information extracted from dictionary entries stored in the LDB can be used to build LKB lexicons.</Paragraph> <Paragraph position="1"> The type system which has been developed for use on the ACQUILEX project is fairly large (about 450 types and 80 features). Currently nearly 15,000 lexical entries containing syntactic and semantic information have been stored in the LKB. The bulk of these entries are currently made up of nouns for which the main semantic information is inherited down semi-automatically derived taxonomies. Sanfilippo and Poznanski (1992) describe the semi-automatic derivation of entries for English psychological predicates by augmenting LDOCE with thesaurus information derived from the Longman Lexicon. Work has begun on deriving multi-lingual linked lexicons.</Paragraph> <Paragraph position="2"> Given the complexity of the FSs for lexical entries, and the size of the lexicons to be supported by the LKB, it is clearly not possible to store lexicons in main memory.</Paragraph> <Paragraph position="3"> Lexical entries are thus stored on disk, to be expanded as required. Entries may be indexed by type of FS at the end of user-defined paths, and also by the psort(s) from which they are defined to inherit, although producing such indices for large lexicons is time consuming. Checking lexical entries (for well-formedness, default inheritance conflicts and presence of cycles) can be carried out at the same time as indexing or acquisition.</Paragraph> <Paragraph position="4"> Efficiency gains arising directly from the use of types were not a major factor in our decision to use a typed system. Although parsing with typed FSs is more efficient than with untyped ones, since unification will fail when type conflict occurs, this is not particularly important in the LKB, since most unifications will be performed while expanding lexical entries, when the vast majority of unifications would be expected to succeed. Since there is some overhead in typing the FSs, the use of types probably decreases efficiency slightly, although the unifications involved will be comparable to those needed if the same information were conveyed by templates. Since the LKB has to cope with large lexicons, with thousands of complex lexical entries, space efficiency rather than speed is the major consideration. The most important factor in space efficiency is the use of inheritance, both in the type system and the psort system, which allows unexpanded lexical entries to be very compact.</Paragraph> </Section> <Section position="2" start_page="93" end_page="93" type="sub_section"> <SectionTitle> 5.2 Typing and automatic acquisition of large lexicons </SectionTitle> <Paragraph position="0"> Our notion of typing of FSs can be regarded as a way of getting the functionality of templates in untyped FS formalisms, with the added advantages of type checking and type inference. As a method of lexical organisation, types have significant advantages over templates, especially for a large scale collaborative project. Once an agreed type system is adopted, the compatability of the data collected by each site is guaranteed. There may of course be problems of differing interpretation of types and features, but this applies to any representation; to ameliorate them we associate short documentation information with each type, accessible via the menu interface from any point where the type is displayed. In an untyped feature system, typographical errors and so on may go undetected, and debugging a large template system can be extremely difficult; a type system makes error detection much simpler. Since a given FS has a type permanently associated with it, is also much more obvious how information has come to be inherited than if templates are used.</Paragraph> <Paragraph position="1"> Essentially the same advantages of safety and clarity apply to strict typing of FSs as to strict typing in programming languages. Of course a reduction in flexibility of representation has to be accepted, once a particular type system is adopted. It is possible to achieve a very considerable degree of modularisation; we have found that we could develop the noun RQS type system almost completely independently of the verb type system, once a small number of common types were agreed on, and that name clashes were the only problem found when reintegrating the two. After approximately eight months of use we are now on the third version of both the verb and the noun type systems; individual users have been experimenting with various representations which are then integrated into the general system as appropriate. Encoding the agreed representation in terms of a type system, rather than by means of templates, makes global alterations relatively easy because of the localisation of the information (for example, since a feature can only be introduced at one point in the hierarchy, it is easy to find all types which will be affected by a change in feature name) and the error checking. It is important that reprocessing of raw dictionary data is avoided when a type system is changed, particularly if user interaction is involved, but storing intermediate results in the LDB as a derived dictionary helps achieve this. Even within the project it has proved useful to have local type systems and lexicons, and to derive entries for these automatically from the general LKB. Currently this is achieved by ad-hoc methods; we intend to investigate the development of tools to make transfer of information easier and more declarative.</Paragraph> <Paragraph position="2"> Ageno et al. (1992) describe one way in which the type system can be integrated with tools for semi-automatic analysis of dictionary definitions. Types are correlated with the templates used in a robust pattern matching parser, and user interaction can be controlled by the type system. The user is only allowed to introduce information appropriate for a particular type, and a menu-based interface can both inform the user of the possible values and preclude errors.</Paragraph> <Paragraph position="3"> The utility of typing for error checking when representing automatically acquired data can be seen in the following simple example. The machine readable version of LDOCE associates semantic codes with senses. Examples of such codes are P for plant, H for human, M for male human, K for male human or animal, and so on.</Paragraph> <Paragraph position="4"> When automatically acquiring information about nouns from LDOCE, we specify a value for the feature SEX, where this is possible according to the semantic codes.</Paragraph> <Paragraph position="5"> Thus the automatically created lexical entry for bull 1 1 contains the line: < rqs : sex > = male In the current type system the feature sex is introduced at type creature. A few LDOCE entries have incorrect semantic codes; Irish stew for example has code K. Since Irish stew has rtQs type c_artifact, which is not consistent with creature, SEX was detected as an inappropriate feature. Attempts at expansion of the automatically generated lexical entry caused an error message to be output, and the user had the opportunity to correct the mistake. If the LKB were not a typed system, errors such as this would not be detected automatically in this way.</Paragraph> <Paragraph position="6"> In contrast, automatic classification of lexical entries by type, according to feature information, can be used to force specification of appropriate information. A lexical entry which has not been located in a taxonomy will be given the most general possible type for its RQS. However ifa value for the feature sex has been specified this forces an rtQs type of creature. This would also force the value of ANIMATE to be true, for example.</Paragraph> </Section> <Section position="3" start_page="93" end_page="94" type="sub_section"> <SectionTitle> 5.3 The psort inheritance mechanism. </SectionTitle> <Paragraph position="0"> Manual association of information with psorts has proved to be a highly efficient method of acquiring information, since many psorts have hundreds of daughtel entries. Creating 'artificial' psorts, which can be used where there is no simple lexicalisation of a concept, ia also a powerful technique. Disjunctions such as persor, or animal, for example, can be represented as the generalisation of the two psorts involved. This and other case., of more complex taxonomic inheritance are discussed b~ Vossen and Copestake (1991, forthcoming).</Paragraph> <Paragraph position="1"> We adopted the most conservative approach to multi.</Paragraph> <Paragraph position="2"> pie default inheritance (i.e. information inherited frorr multiple parents has to be consistent) because we kne~ we would have to cope with errors in extraction of in.</Paragraph> <Paragraph position="3"> formation from MRDs, and with the lexicographers original mistakes. We expected this to be overrestric.</Paragraph> <Paragraph position="4"> tive, but in fact our consistency condition seems to be met fairly naturally by the data. Taxonomies extractec from MRDs are in general tree-structured (once sense.</Paragraph> <Paragraph position="5"> disambiguation has been performed); there do not tenc to be many examples of genuine conjunction, for exam pie. Multiple inheritance is mainly needed for cross classification; artifacts for example may be defined prin cipally in terms of their form or in terms of their func tion, but here different sets of features are typically spec ified, so the information is consistent. Furthermore i frequently turns out to be difficult to identify a second psort parent from the dictionary definition differentia.</Paragraph> <Paragraph position="6"> However type inference resulting from feature instantiation may still force a type to be assigned which represents the cross-classification.</Paragraph> </Section> </Section> class="xml-element"></Paper>