File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/85/e85-1021_metho.xml
Size: 28,419 bytes
Last Modified: 2025-10-06 14:11:41
<?xml version="1.0" standalone="yes"?> <Paper uid="E85-1021"> <Title>DESIGN AND IMPLEMENTATION OF A LLXICAL DATA BASE</Title> <Section position="4" start_page="0" end_page="151" type="metho"> <SectionTitle> OVERVIEW OF THE PROBLSM </SectionTitle> <Paragraph position="0"> One of the well-known characteristic features of natural languages is the size and the complexity of their lexicons. This is in sharp constrast with artificial languages, which typically have small lexicons, in most cases made up of simple, unambiguous lexical items. Not only do natural languages have a huge number of lexical elements -- no matter what precise definition of this latter term one chooses -- but these lexical elements can furthermore (i) be ambiguous in several ways (ii) have a non-trivial internal structure, or (iii) be part of compounds or idiomatic expressions, as illustrated in (1)-(A): (I) ambiguous words: can, fly, bank, pen, race, etc.</Paragraph> <Paragraph position="1"> (2) internal structure: use-ful-ness, mis-understand-ing, lake-s, tri-ed (3) compounds: milkman, moonlight, etc.</Paragraph> <Paragraph position="2"> (4) idiomatic expressions: to kick the bucket, by and large, to pull someone's leg, etc.</Paragraph> <Paragraph position="3"> In fact, the notion of word, itself, is not all that clear, as numerous linguists -theoreticians and/or computational linguists -have acknowledged. Thus, to take an example from the computational linguistics literature, Kay (1977) notes: &quot;In common usage, the term word refers sometimes to sequences of letters that can be bounded by spaces or punctuation marks in a text. According to this view, run, runs, runnin~ and ran are different words. But common usage also allows these to count as instances of the same word because they belong to the same paradigm in English accidence and are listed in the same entry in the dictionary.&quot; Some of these problems, as well as the general question of what constitutes a lexical entry, whether or not lexical items should be related to one another, etc. have been much debated over the last I0 or 15 years within the framework of generative grammar. Considered as a relatively minor appendix of the phrase-structure rule component in the early days of generative grammar, the lexicon became little by little an autonomous component of the grammar with its own specific formalism -- lexical entries as matrices of features, as advocated by Chomsky (1965). Finally, it also acquired specific types of rules, the so-called word formation rules (cf. Halle, 1973; Aronoff, 1976; Lieber, 1980; Selkirk, 1983, and others), and lexical redundancy rules (cf. Jackendoff, 1975; Bresnan, 1977).</Paragraph> <Paragraph position="4"> By and large, there seems to be widespread agreement among linguists that the lexicon should be viewed as the repository of all the idiosyncratic properties of the lexical items of a language (phonological, morphological, syntactic, semantic, etc.). This agreement quickly disappears, however, when it comes to defining what constitutes a lexical item, or, to put it slightly differently, what the lexicon is a list of, and how should it be organized.</Paragraph> <Paragraph position="5"> Among the many proposals discussed in the linguistic literature, I will consider two radically opposed views that I shall call the morpheme-bayed and the word-based conceptions of the lexicon .</Paragraph> <Paragraph position="6"> The morpheme-based lexicon corresponds to the traditional derivational view of the lexicon, shared by the structuralist school, many of the generative linguists and virtually all the computational linguists. According to this option, only non-derived morphemes are actually listed in the lexicon, complex words being derived by means of morphological rules. In contrast, in a word-based lexicon a la Jackendoff, all the words (simple and complex) are listed as independent lexical entries, derivational as well as inflectional relgt~ons being expressed by means of redundancy rules-'-.</Paragraph> <Paragraph position="7"> The crucial distinction between these two views of the lexicon has to do with the role of morphology. The morpheme-based conception of the lexicon advocates a dynamic view of morphology, i.e. a conception according to which &quot;words are generated each time anew&quot; (Hoekstra et al. 1980). This view contrasts with the static conception of morphology assumed in Jackendoff's word-based theory of the lexicon.</Paragraph> <Paragraph position="8"> Interestingly enough, with the exception of some (usually very small) systems with no morphology at all, all the lexicons in computational linguistic projects seem to assume a dynamic conception of morphology.</Paragraph> <Paragraph position="9"> The no-morphology option, which can be viewed as an extreme version of the word-based lexicon mentioned above modulo the redundancy rules, has been adopted mostly for convenience by researchers working on parsers for languages fairly uninteresting from the point of view of morphology, e.g. English. It has the non-trivial merit of reducing the lexical analysis to a simple dictionary look-up. Since all flectional forms of a given word are listed independently, all the orthographic words must be present in the lexicon. Thus, this option presents the double advantage of being simple and efficient. The price to pay is fairly high, though, in the sense that the resulting lexicon displays an enormous amount of redundancy: lexical information relevant for a whole class of morphologically related words has to be duplicated for every member of the class. This duplication of information, in turn, makes the task of updating and/or deleting lexical entries much more complex than it should be.</Paragraph> <Paragraph position="10"> This option is more seriously flawed than just being redundant and space-greedy, though. By ignoring the obvious fact that words in natural languages do have some internal structure, may belong to declension or conjugation classes, but above all that different orthographical words may in fact realize the same grammatical word in different syntactic environments it fails to be descriptively adequate. Interestingly enough, this inadequacy turns out to have serious consequences. Consider, for example, the case of a translation system. Because a lexicon of this exhaustive list type has no way of representing a notion such as &quot;lexeme&quot;, it lacks the proper level for lexical transfer. Thus, if been, was, were, a._m.m and be are treated as independant words, what should be their translation, say in French, especially if we assume that the French lexicon is organized on the same model? The point is straightforward: there is no way one can give translation equivalents for orthographic words. Lexical transfer can only be made at the more abstract level of lexeme. The choice of a particular orthographic word to realize this lexeme is strictly language dependent. In the previous example, assuming that, say, were is to be translated as a form of the verbe etre, the choice of the correct flectional form will be governed by various factors and properties of the French sentence. In other words, a transfer lexicon must state the fact that the verb to be is translated in French by etre, rather than the lower level fact that under some circumstances were is translated by etaient.</Paragraph> <Paragraph position="11"> The problems caused by the size and the complexity of natural language lexicons, as well as the basic inadequacy of the &quot;no morphology&quot; option just described, have been long acknowledged by computational linguists, in particular by those involved in the development of large-scale application programs such as machine translation. It is thus hardly surprising that some version of the morpheme-based lexicon has been the option common to all large natural language systems.</Paragraph> <Paragraph position="12"> There is no doubt that restricting the lexicon to basic morphemes and deriving all complex words as well as all the inflected forms by morphological rules, reduces substantially the size of the lexicon. This was indeed a crucial issue not so long ago, when computer memory was scarce and expensive.</Paragraph> <Paragraph position="13"> There are, however, numerous problems -linguistic, computational as well as practical -with the morpheme-based conception of the lexicon. Its inadequacy from a theoretical linguistic point of view has been discussed abundantly in the &quot;lexicalist&quot; literature. See in particular Chomsky (1970), Halle (1973) and Jackendoff (1975). Some of the linguistic problems are summarized below, along with some mentions of computational as well as practical problems inherent to this approach. First of all, from a conceptual point of view, the adoption of a derivational model of morphology suggests that the derivation of a word is very similar, as a process, to the derivation of a sentence. Such a view, however, fails to recognize some fundamental distinctions between the syntax of words and the syntax of sentences, for instance regarding creativity. Whereas the vast majority of the words we use are fixed expressions that we have heard before, exactly the opposite is true of sentences: most sentences we hear are likely to be novel to us.</Paragraph> <Paragraph position="14"> Also, given a morpheme-based lexicon, the morphological analysis creates readings of words that do not exist, such as strawberry understood as a compund of the morphemes @traw and berrz.</Paragraph> <Paragraph position="15"> This is far from being an isolate case, examples like the following are not hard to find: (5)a. comput-er b. trans-mission c. under-stand d. re-ply e. hard-ly The problem with these words is that they are morphologically composed of two or more morphemes, but their meaning is not derivable from the meaning of these morphemes. Notice that listing these words as such in the lexicon is not sufficient. The morphological analysis will still apply, creating an additional reading on the basis of the meaning of its parts. To block this process requires an ad hoc feature, i.e. a specific feature saying that this word should not be analysed any further.</Paragraph> <Paragraph position="16"> Generally speaking, the morpheme-based lexicon along with its word formation rules, Joe. the rules that govern the combination of morphemes is bound to generate far more words (or readings of words) than what really exists in a particular language. It is clearly the case that only a strict subset of the possible combination of morphemes are actually realized. To put it differently, it confuses the notion of potential word 4 for a language with the notion of actual word .</Paragraph> <Paragraph position="17"> This point was already noticed in Halle (1973), who suggested that in addition to the list of morphemes and the word formation rules which characterize the set of possible words, there must exist a list of actual words which functions as a filter on the output of word formation rules. This filter, in other words, accounts for the difference between potential words and actual words.</Paragraph> <Paragraph position="18"> The idiosyncratic behaviour of lexical items has been further stressed in &quot;Remarks on Nominalization&quot; where Chomsky convincingly argues that the meaning of derived nominals, such as those in (6), cannot be derived by rules from the meaning of its constitutive morphemes. Given the fact that derivational morphology is semantically irregular it should not be handled in the syntax. Chomsky concludes that derived nominals must be listed as such in the lexicon, the relation between verb and nominals beeing captured by lexical redundancy rules.</Paragraph> <Paragraph position="19"> (6)a. revolve revolution bo marry marriage Co do deed d. act action It should be noticed that the somewhat erratic and unpredictable morphological relations are not restricted to the domain of what is traditionally called derivation. As Halle points out (p. 6), the whole range of exceptional behaviour observed with derivation can be found with inflection. Halle gives examples of accidental gaps such as defective paradigms, phonological irregularity (accentuation of Russian nouns) and idiosyncratic meaning.</Paragraph> <Paragraph position="20"> From a computational point of view,' a morpheme-based lexicon has few merits beyond the fact that it is comparatively small in size. In the generation process as well as in the analysis process the lack of clear distinction between possible and actual words makes it unreliable -i.e. one can never be sure that its output is correct. Also, since a large number of morphological rules must systematically be applied to every single word to make sure that all possible readings of each word is taken into consideration, lexical analysis based on such conceptions of the lexicon are bound to be fairly inefficient. Over the years, increasingly sophisticated morphological parsers have been designed, the best examples being Kay's (1977), Karttunen (1983) and Koskeniemmi (1983a,b), but not surprisingly, the efficiency of such systems remain well below the simple dictionary lookup 9. Also, this model has the dubious property that the retrieval of an irregular form necessitates less computation than the retrieval of a regular form. This is so because unlike regular forms that have to be created/analyzed each time they are used, irregular forms are listed as such in the lexicon. Hence, they can simply be looked up.</Paragraph> <Paragraph position="21"> This rapid and necessarily incomplete overview of the organization of the lexicon and the role of morphology in theoretical and computational linguistics has emphasized two basic types of requirements: the linguistic requirements which have to do with descriptive adequacy of the model, and the computational requirements which has to do with the efficiency of the process of lexical analysis or generation. In particular, we argued that a lexicon consisting of the list of all the inflected forms without any morphology fails to meet the first requirement, i.e.</Paragraph> <Paragraph position="22"> linguistic adequacy. It was also pointed out that such a model lacks the abstract lexical level which is relevant, for instance, for lexical transfer in translation systems. Although clearly superior to what we called the &quot;no morphology&quot; system, the traditional morpheme-based model runs into numerous problems with respect to both linguistic and computational requirements.</Paragraph> <Paragraph position="23"> A third type of considerations which are often overlooked in academical discussions, but turns out to be of primary importance for any &quot;real life&quot; system involving a large lexical data base is what I would call &quot;practical requirements&quot; and has to do with the complexity of the task of creating a lexical entry. It can roughly be viewed as a measure of the time it takes to create a new lexical entry, and of the amount of linguistic knowledge that is required to achieve this task. The relevance of these practical requirements becomes more and more evident as large natural language processing systems are being developed. For instance, a translation system -- or any other type of natural language processing program that must be able to handle very large amounts of text -- necessitates dictionaries of substantial size, of the order of at least tens of thousands of entries, perhaps even more than I00,000 lexical entries. Needless to say the task of creating as well as the one of updating such huge databases represents an astronomical investment in terms of human resources which cannot be overestimated.</Paragraph> <Paragraph position="24"> Whether it takes an average of, say 3 minutes, to enter a new lexical entry or 30 minutes may not be all that important as long as we are considering lexicons of a few hundred words. It may be the difference between feasible a~d not feasible when it comes to very big databases .</Paragraph> <Paragraph position="25"> Another important practical issue is the level of linguistic knowledge that is required from the user. Systems which require little technical knowledge are to be preferred to those requiring an extensive amount of linguistic background, everything else being equal. It should be clear, in this respect, that morpheme-based lexicons tend to require more linguistic knowledge from the user than a word-based lexicon, since the user has to specify (i) what the morphological structure of the word is (ii) to what extent the meaning of the word is or is not derived from the meaning of its parts, (iii) what morphophonological rules apply in the derivation of this word.</Paragraph> <Paragraph position="26"> A RELATIONAL WORD-BASED LEXICON The traditional view in computational linguistics is to assume some version of the morpheme-based lexicon, coupled with a morphological analyzer/generator. Thus it is assumed that a dynamic morphological process takes place both in the analysis and in the generation of words (i.e. orthographical words). Each time a word is read or heard, it is decomposed into its atomic constituents and each time it is produced it has t~ be re-created from its atomic constituents .</Paragraph> <Paragraph position="27"> As I pointed out earlier, I don't see any compelling evidence supporting this view other than the simplicity argument. Crucial for this argument, then, is the assumption that the complexity measure is just a measure of the length of the lexicon, i.e. the sum of the symbols contained in the lexicon.</Paragraph> <Paragraph position="28"> One cannot exclude, though, more sophisticated ways to mesure the complexity of the lexicon. Jackendoff (1975:640) suggests an alternative complexity measure based on &quot;independent information content&quot;. Intuitively, the idea is that redundant information that is predictable by the existence ~f a redundancy rule does not count as independent .</Paragraph> <Paragraph position="29"> Assumimg a strict lexicalist framework a la Jackendoff, we developed a word-based lexical database dubbed relational word-based lexicon (RWL). Essentially, the RWL model is a list-type lexicon with cross references. All the words of the language are listed in such a lexicon and have independent lexical entries. The morphological relations between two or more lexical entries are captured by a complex network of relations. The basic idea underlying this organization is to factor out properties shared by several lexical entries.</Paragraph> <Paragraph position="30"> To take a simple example, all the morphological forms of the English verb run have a lexical entry. Hence, run, runs, ra._~n and runnin~ are listed independently in the lexicon. At the same time, however, these four lexical entries are to be related in some way to express the fact that they are morphologically related, i.e. they belong to the same paradigm. In turns, this has the further advantage of providing a clear definition of the &quot;lexeme&quot;, the abstract lexical unit which is relevant, for instance, for lexical transfer, as will be pointed out below.</Paragraph> <Paragraph position="31"> In contrast with the common use in computational linguistics, 9in this model morphology is essentially static . By interpreting morphology as relations within the lexical database rather than as a process, we shift some complexity from the parsing algorithm to the lexical data structures. Whether or not this shift is justified from a linguistic point of view is an open question, and I have nothing co say about it here. From a computational point of view, though, this shift has rather interesting consequences. First of all, it drastically simplifies the task of lexical analysis (or generation), making it a deterministic process N as opposed to a necessarily non-deterministic morphological parser. In fact, it makes lexical analysis rather trivial, equating it with a fairly simple database query. It follows that the process of retrieving an irregular word is identical to the process of retrieving a regular word. The distinction between regular morphological forms and exceptional ones has no effect on the lexical analysis, i.e. on processing. Rather, it affects the complexity measure of the lexicon.</Paragraph> <Paragraph position="32"> Also, in sharp contrast to what happens with a derivational conception of morphology, in our model, the morphological complexity of a language has very little effect on the efficiency of lexical analysis, which seems essentially correct: speakers of morphologically complex languages do not seem to require significantly more time to parse individual words than speakers of, say, English.</Paragraph> <Paragraph position="33"> A partial implementation of this relational word-based model of the lexicon has been realized for the parser for French described in Wehrli (1984). This section describes some of the features of this implementation. Only inflection has been implemented, so far. Some aspects of derivational morphology should be added in the near future.</Paragraph> <Paragraph position="34"> In this implementation, lexical entries are composed of three distinct kinds of objects referred to as words, morpho-syntactic elements and lexemes, cf. figure I. A word is simply a string of characters, or what is sometimes called an orthographic word. It is linked to a set of morpho-syntactic elements, each one of them specifying a particular grammatical reading of the word. A morpho-syntactic element is a just a particular set of grammatical features such as category, gender, number, person, case, etc. A lexeme contains all the information shared by all the flectional forms of a given lexical item. The lexeme is defined as a set of syntactic and semantic features shared by one or several morpho-syntactic elements. Roughly speaking, it contains the kind of information one expect to find in a standard dictionary entry.</Paragraph> <Paragraph position="36"> V, paat part. ~&quot;~&quot; ,\\\ N, sg.</Paragraph> <Paragraph position="37"> &quot;~ V, inf. 4 - V. I.st pl. pres. / In relational ~erms, fully-specified lexical entries are broken into three different relations. The full set of information belonging to a lexical entry can be obtained by intersecting the three relations.</Paragraph> <Paragraph position="38"> The following example illustrates the structure of the lexical data base and the respective roles of words, morpho-syntactic elements and lexemes. In French, suis is ambiguous. It is the first person singular present tense of the verb etre ('to be'), which, as in English, is both a verb and an auxiliary. But suis is also the first and second person singular present tense of the verb suivre ('to follow'). This information is represented as follows: the lexicon has a word (in the technical sense, i.e. a s~ring of characters) suis associated with two morpho-syntactic elements. The first morpho-syntactic element which bears the features \[+V, Ist, sg, present\] is linked to a list of two lexemes. One of them contains all the general properties of the verb etre, the other one the information corresponding to the auxiliary reading of etre. As for the second morpho-syntactic element, it bears the features \[+V, Ist-2nd, sg, present\] and it is related to the lexeme containing the syntactic and semantic features characterizing the verb suivre.</Paragraph> <Paragraph position="39"> Such an organization allows for a substantial reduction of redundancy. All the different morphological forms of etre, i.e. over 25 different words are ultimately linked to 2 lexemes (verbal and auxiliary readings). Thus, information about subcategorization, selectional restrictions, etc. is specified only once rather than 25 times or more. Naturally, this concentration of the information also simplifies the updating procedure. Also, as we pointed out above, this structure provides a clear definition of &quot;lexeme&quot;, the abstract lexical representation, which is the level of representation relevant for transfer in translation systems.</Paragraph> <Paragraph position="40"> Figure i, above, illustrates the structure of the lexical database. Boxes stand for the different items (words, morphosyntactic elements, lexemes) and arrows represent the relations between these items. Notice that not all morphosyntactic elements are associated with some lexemes. In fact, there is a lexeme level only for those categories which display morphological variation, i.e. nouns, adjectives, verbs and determiners.</Paragraph> <Paragraph position="41"> The arrow between the words est and est-ce que expresses the fact that the string est occurs at the initial of the compound est-ce que. This is the way compounds are dealt with in this lexicon. The compound clair de lune ('moonlight') is listed as an independent word M along with its associated morphosyntactic elements and lexemes -related to the word clair. The function of this relation is to signal to the analyzer that the word clair is also the first segment of a compound.</Paragraph> <Paragraph position="42"> Consider the vertical arrow between the lexeme corresponding to the verbal reading of etre ('to be') and the lexeme corresponding to the auxiliary reading of etre. It expresses the fact that a given morphosyntactic element may have several distinct readings (in this case the verbal reading and the auxiliary reading). Thus, morphosyntactic elements can be related not just to one lexeme, but to a list of lexemes.</Paragraph> <Paragraph position="43"> The role of morphology in Jackendoff's system is double. First, the redundancy rules have a static role, which is to describe morphological patterns in the language, and thus to account for word-structure. In addition to this primary role, morphology also assumes a secondary role, in the sense that it can be used to produce new words or to analyze words that are not present in the lexicon. In this respect, Jackendoff (1975:668) notes, &quot;lexical redundacy rules are learned form generalizations observed in already known lexical items. Once learned, they make it easier to learn new lexical items&quot;. In other words, redundancy rules can also function as word ~rmation rules and, hence, have a dynamic function In our implementation of the relational word-based lexicon, morphology has also a double function. On the one hand, morphological relations are embedded in the structure of the database itself and, roughly, correspond to Jackendoff's redundancy rules in their static role. On the other hand, morphological rules are considered as &quot;learning rules&quot;, i.e. as devices which facilitate the acquisition of the paradigm of the inflected forms of a new lexeme. As such, morphological rules apply when a new word is entered in the lexicon. Their role is to help and assist the user in his/her task of entering new lexical entries. For example, if the infinitival form of a verb is entered, the morphological rules are used to create all the inflected forms, in an interactive session. So, for instance, the system first considers the verb to be morphologically regular. If so, that is if the user confirms this hypothesis, the system generates all the inflected forms without further assistance. If the answer is no, the system will try another hypothesis, looking for subregularities.</Paragraph> <Paragraph position="44"> Our relational word-based lexicon was first implemented on a relational database system on a VAX-780. However, for efficiency reasons, it was transfered to a more conventional system usin B indexed sequential and direct access files. In its present implementation, on a VAX-750, words and morphosyntactic elements are stored in indexed sequential files, lexemes in direct access files. In other words, the lexicon is entirely stored in external files, which can be expanded, practically without affecting the efficiency of the system. A set of menu-oriented procedures allow the user to interact with the lexical data base, to either insert, delete, update or just visualize words and their lexical specifications.</Paragraph> </Section> class="xml-element"></Paper>