File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/87/j87-3002_metho.xml

Size: 38,212 bytes

Last Modified: 2025-10-06 14:11:59

<?xml version="1.0" standalone="yes"?>
<Paper uid="J87-3002">
  <Title>LARGE LEXICONS FOR NATURAL LANGUAGE PROCESSING: UTILISING THE GRAMMAR CODING SYSTEM OF LDOCE</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 TIlE ACCESS ENVIRONMENT
</SectionTitle>
    <Paragraph position="0"> There is a well recognised problem with providing computational support for machine readable dictionaries, in particular where issues of access are concerned. On the one hand, dictionaries exhibit far too much structure for conventional techniques for managing 'flat' text to apply to them. On the other hand, the equally large amounts of free text in dictionary entries, as well as the implicitly marked relationships commonly used to encode linguistic information, makes a dictionary difficult to represent as a structured database of a standard, eg. relational, type. In addition, in order to link the machine readable version of LDOCE to our development environment, and eventually to our natural language processing systems, we need to provide fast access from Lisp to data held in secondary storage.</Paragraph>
    <Paragraph position="1"> Lisp is not particularly well suited for interfacing to complex, structured objects, and it was not our intention to embark on a major effort involving the development of a formal model of a dictionary (of the style described in, eg., Tompa 1986); on the other hand a method of access was clearly required, which was flexible enough to support a range of applications intending to make use of the LDOCE tape.</Paragraph>
    <Paragraph position="2"> The requirement for having the dictionary entries in a form convenient for symbolic manipulation from within Lisp was furthermore augmented by the constraint that all the information present in the typesetting tape should be carried over to the on-line version of LDOCE, since it is impossible to say in advance which records and fields of an entry would, or would not, be of potential use to a natural language processing program. Finally, the complexity of the data structures stored on disc should not be constrained in any way by the method of access, as we do not have a very clear idea what form the restructured dictionary may eventually take.</Paragraph>
    <Paragraph position="3"> Given that we were targeting all envisaged access routes from LDOCE to systems implemented in Lisp, and since the natural data structure for Lisp is the s-expression, we adopted the approach of converting the tape source into a set of list structures, one per entry. Our task was made possible by the fact that while far from being a database in the accepted sense of the word, the LDOCE typesetting tape is the only truly computerised dictionary of English (Michiels, 1983).</Paragraph>
    <Paragraph position="4"> 204 Computational Linguistics, Volume 13, Numbers 3-4, July-December 1987 Bran Boguraev and Ted Briscoe Large Lexicons for Natural Language Processing The logical structure of a dictionary entry is reflected on the tape as a sequence of typed records (see Figure 1), each with additional internal segmentation, where records and fields correspond to separate units in an entry, such as headword, pronunciation, grammar code, word senses, and so forth.</Paragraph>
    <Paragraph position="5">  The &amp;quot;lispification&amp;quot; of the typesetting tape was carfled out in a series of batch jobs, via a program written in a general text editing facility. The need to carry out the conversion without any loss of information meant that special attention had to be paid to the large number of non-printing characters which appear on the tape.</Paragraph>
    <Paragraph position="6"> Most of these signal changes in the typographic appearance of the printed dictionary, where crucial information about the structure of an entry is represented by changes of typeface and font size. All control characters were translated into atoms of the form *AB, where A and B correspond to the hexadecimal digits of the ASCII character code. Information was thus preserved, and readily available to any program which needed to parse the implicit structure of a dictionary entry or field, and the lispified source was made suitable for transporting between different software configurations and operating systems. Figure 2 illustrates part of an entry as it appears in the published dictionary, on the typesetting tape and after lispification.</Paragraph>
    <Paragraph position="7"> Note that as a result of the lispification, brackets have been inserted at suitable points, both to delimit entries and indicate their internal structure; in addition characters special to Lisp have been appropriately escaped. Thus an individual dictionary entry can now be made available to a client program by a single call to a generic read function, once the Lisp reader has been properly positioned and 'aligned' with the beginning of rivet 2 u 1 \[TI;X9\] to cause to fasten with RIVETst:...</Paragraph>
    <Paragraph position="8">  the s-expression encoding the required entry. In the lispified entry in Figure 2 the numbers at the head of each sublist indicate the type of information stored in each field within the overall entry. For example, &amp;quot;5&amp;quot; is the part of speech field, and &amp;quot;8&amp;quot; is the word sense definition.</Paragraph>
    <Paragraph position="9"> The 60,000 or so complete entries of the processed dictionary require of the order of 20 MBytes to store.</Paragraph>
    <Paragraph position="10"> The problem of access, from Lisp, to the dictionary entry s-expressions held on secondary storage cannot be resolved by ad hoc solutions, such as sequential scanning of files on disc or extracting subsets of such files which will fit in main memory, as these are not adequate as an efficient interface to a parser. (Exactly the same problem would occur if our natural language systems were implemented in Prolog, since the Prolog 'database facility' refers to the knowledge base that Prolog maintains in main memory.) In principle, given that the dictionary is now in a Lisp-readable format, a powerful virtual memory system might be able to manage access to the internal Lisp structures resulting from reading the entire dictionary; we have, however, adopted an alternative solution as outlined below.</Paragraph>
    <Paragraph position="11"> We have mounted LDOCE on-line under two different hardware configurations. In both cases the same lispified form of the dictionary has been converted into a random access file, paired together with an indexing file from which the disc addresses of dictionary entries for words and compounds can be computed.</Paragraph>
    <Paragraph position="12"> A series of systems in Cambridge are implemented in Lisp running under Unix TM. They all make use of an efficient dictionary access system which services requests for s-expression entries made by client programs. A dictionary access process is fired off, which dynamically constructs a search tree and navigates through it from a given homograph directly to the offset in the lispified file from where all the associated information can be retrieved. As Alshawi (1987) points out, given that no situations were envisaged where the information from the tape would be altered once installed in secondary storage, this simple and conven-Computational Linguistics, Volume 13, Numbers 3-4, July-December 1987 205 Bran Boguraev and Ted Briscoe Large Lexicons for Natural Language Processing tional access strategy is perfectly adequate. The use of such standard database indexing techniques makes it possible for an active dictionary process to be very undemanding with respect to main memory utilisation.</Paragraph>
    <Paragraph position="13"> For reasons of efficiency and flexibility of customisation, namely the use of LDOCE by different client programs and from different Lisp and/or Prolog systems, the dictionary access system is implemented in the programming language C and makes use of the inter-process communication facilities provided by the Unix operating system. To the Lisp programmer, the creation of a dictionary process and subsequent requests for information from the dictionary appear simply as Lisp function calls.</Paragraph>
    <Paragraph position="14"> Most of the recent work with the dictionary, and in particular the decompacting and analysis of the grammar codes has been carried out in Interlisp-D on Xerox 1100 series workstations. The same lispified form of the dictionary was used. Originally it was installed on a single workstation and only available locally. Instead of a separate process building a search tree, the access method relies on a precompiled, multilevel indexing structure which allows direct hashing into the on-line source. In addition, the powerful Interlisp-D virtual memory allows the access system to be significantly enhanced by caching most of the working subset of the dictionary at any given turn in main memory. It turns out that for a single user workstation, specially tuned for Lisp and operations optimised at the microcode level for random file access and s-expression I/O, this strategy offers remarkably good results.</Paragraph>
    <Paragraph position="15"> More recently, a dictionary server, of the kind described by Kay (1984b), was implemented and installed as a background process on a Xerox workstation networked together with the rest of the equipment dedicated to natural language processing applications (Boguraev et al., 1987). Again, the same lispified form of the machine readable source of LDOCE was used. From the point of view of providing a centralised service to more than one client, efficiently over a packet switching network, disc space on the server processor was not an issue. This made it possible to construct a larger, but more comprehensive, index for the dictionary, which now allows the recovery of a word in guaranteed time (typically less than a second).</Paragraph>
    <Paragraph position="16"> The main access route into LDOCE for most of our current applications is via the homograph fields (see Figure 1). Options exist in the access software to specify which particular homograph (or homographs) for a lexical item is required. The early process of lispification was designed to bring together in a single group all dictionary entries corresponding not only to different homographs, but also to lexicalised compounds for which the argument word appears as the head of the compound. Thus, the primary index for blow allows access to two different verb homographs (eg. blow 3) , two different noun homographs (eg. blow2), 10 compounds (eg. blow offand blow-by-blow), or all 14 of the dictionary entries (not necessarily to be found in subsequent positions in the dictionary) related to blow.</Paragraph>
    <Paragraph position="17"> While no application currently makes use of this facility, the motivation for such an approach to dictionary access comes from envisaging a parser which will operate on the basis of the on-line LDOCE; and any serious parser must be able to recognise compounds before it segments its input into separate words.</Paragraph>
    <Paragraph position="18"> From the master LDOCE file, we have computed alternative indexing information, which allows access into the dictionary via different routes. In addition to headwords, dictionary search through the pronunciation field is available; Carter (1987) has merged information from the pronunciation and hyphenation fields, creating an enhanced phonological representation which allows access to entries by broad phonetic class and syllable structure (Huttenlocher and Zue, 1983). In addition, a fully flexible access system allows the retrieval of dictionary entries on the basis of constraints specifying any combination of phonetic, lexical, syntactic, and semantic information (Boguraev et al., 1987). Independently, random selection of dictionary entries is also provided to allow the testing of software on an unbiased sample.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 THE FORMAT OF THE GRAMMAR CODES
</SectionTitle>
    <Paragraph position="0"> The lispified LDOCE file retains the broad structure of the typesetting tape and divides each entry into a number of fields -- head word, pronunciation, grammar codes, definitions, examples, and so forth. However, each of these fields requires further decoding and restructuring to provide client programs with easy access to the information they require (see Calzolari (1984) for further discussion). For this purpose the formatting codes on the typesetting tape are crucial since they provide clues to the correct structure of this information. For example, word senses are largely defined in terms of the 2000 word core vocabulary, however, in some cases other words (themselves defined elsewhere in terms of this vocabulary) are used. These words always appear in small capitals and can therefore be recognised because they will be preceded by a font change control character. In Figure 1 above the definition of rivet as verb includes the noun definition of &amp;quot;RIVET 1'', as signalled by the font change and the numerical superscript which indicates that it is the first (i.e. noun entry) homograph; additional notation exists for word senses within homographs. On the typesetting tape, font control characters are indicated by hexadecimal numbers within curly brackets. In addition, there is a further complication because this sense is used in the plural and the plural morpheme must be removed before RIVET can be associated with a dictionary entry.</Paragraph>
    <Paragraph position="1"> However, the restructuring program can achieve this because such morphology is always italicised, so the program knows that, in the context of non-core vocabulary items, the italic font control character signals the  kind !, and are usu !. used together : *46 a pair of shoes T I a beautiful pair of legs *44 *63 compare *CA COUPLE *CB *8B *45 b *44 2 playing cards of the same value but of different *CA SUIT *CB *46 s *8A *44 (3) : *46 a pair of kings) (7 300 &lt; GC &lt; .... &lt; --S-U---Y)  (8 *45 a *44 2 people closely connected : *46 a pair of dancers *45 b *CA COUPLE *CB *8B *44 (2) (esp \[. in the phr !. *45 the happy pair *44) *45 c *46 sl *44 2 people closely connected who cause annoyance or displeasure : *46 You!'re a fine pair coming as late as this \[\[ ........ )  occurrence of a morphological variant of a LDOCE head entry.</Paragraph>
    <Paragraph position="2"> A suite of programs to unscramble and restructure all the fields in LDOCE entries has been written which is capable of decoding all the fields except those providing cross-reference and usage information for complete homographs. Figure 3 illustrates a simple lexical entry before and after the application of these programs. The development of the restructuring programs was a non-trivial task because the organisation of information on the typesetting tape presupposes its visual presentation, and the ability of human users to apply common sense, utilise basic morphological knowledge, ignore minor notational inconsistencies, and so forth. To provide a test-bed for these programs we have implemented an interactive dictionary browser capable of displaying the restructured information in a variety of ways and representing it in perspicuous and expanded form.</Paragraph>
    <Paragraph position="3"> In what follows we will discuss the format of the grammar codes in some detail as they are the focus of the current paper, however, the reader should bear in mind that they represent only one comparatively constrained field of an LDOCE entry and therefore, a small proportion of the overall restructuring task. Figure 4 illustrates the grammar code field for the third word sense of the verb believe as it appears in the published dictionary, on the typesetting tape and after restructuring. null believe v ... B \[TSa,b;V3;X(to be)l, (to be)7\] (7 300 !&lt; TSa l, b !; V3 !; X (*46 to be *44) 1 !, (*46 to be *44) 7 !&lt; ) sense-no 3 head: TSa head: T5b head: V3 head: X1 right optional (to be) head: X7 right optional (to be) Figure 4 LDOCE provides considerably more syntactic information than a traditional dictionary. The Longman lexicographers have developed a grammar coding system capable of representing in compact form a non-trivial amount of information, usually to be found only in large descriptive grammars of English (such as Quirk et al., 1985). A grammar code describes a particular pattern of behaviour of a word. Patterns are descriptive, and are used to convey a range of information: eg.</Paragraph>
    <Paragraph position="4"> distinctions between count and mass nouns (dog vs.</Paragraph>
    <Paragraph position="5"> desire), predicative, postpositive and attributive adjectives (asleep vs. elect vs. jokular), noun complementation (fondness, fact) and, most importantly, verb complementation and valency.</Paragraph>
    <Paragraph position="6"> Grammar codes typically contain a capital letter, followed by a number and, occasionally, a small letter, for example \[T5a\] or \[V3\]. The capital letters encode information &amp;quot;about the way a word works in a sentence Computational Linguistics, Volume 13, Numbers 3-4, July-December 1987 207 Bran Boguraev and Ted Briscoe Large Lexicons for Natural Language Processing or about the position it can fill&amp;quot; (Procter, 1978: xxviii); the numbers &amp;quot;give information about the way the rest of a phrase or clause is made up in relation to the word described&amp;quot; (ibid.). For example, &amp;quot;T&amp;quot; denotes a transitive verb with one object, while &amp;quot;5&amp;quot; specifies that what follows the verb must be a sentential complement introduced by that. (The small letters, eg. &amp;quot;a&amp;quot; in the case above, provide further information typically related to the status of various complementisers, adverbs and prepositions in compound verb constructions: eg.</Paragraph>
    <Paragraph position="7"> &amp;quot;a&amp;quot; indicates that the word that can be left out between a verb and the following clause.) As another example, &amp;quot;V3&amp;quot; introduces a verb followed by one NP object and a verb form (V) which must be an infinitive with to (3). In addition, codes can be qualified with words or phrases which provide further information concerning the linguistic context in which the described item is likely, and able, to occur; for example \[Dl(to)\] or \[L(to be)l\]. Sets of codes, separated by semicolons, are associated with individual word senses in the lexical entry for a particular item, as Figure 5 illustrates. These sets are elided and abbreviated in the code field associated with the word sense to save space. Partial codes sharing an initial letter can be separated by commas, for example \[T1,5a\]. Word qualifiers relating to a complete sequence of codes can occur at the end of a code field, delimited by a colon, for example \[T1 ;I0: (DOWN)\].</Paragraph>
    <Paragraph position="8"> Codes which are relevant to all the word senses in an entry often occur in a separate field after the head word and occasionally codes are elided from this field down into code fields associated with each word sense as, for example, in Figure 6. Decompacting and restructuring grammar code entries into a format more suitable for further automated analysis can be done with knowledge of the syntax of the grammar code system and the significance of punctuation and font changes. However, discovering the syntax of the system is difficult since no explicit description is available from Longman and the code is geared more towards visual presentation than formal precision; for example, words which qualify codes, such as &amp;quot;to be&amp;quot; in Figure 4, appear in italics and therefore, will be preceded by the font control character *45. But sometimes the thin space control character *64 also appears; the insertion of this code is based solely on visual criteria, rather than the informational structure of the dictionary. Similarly, choice of font can be varied for reasons of appearance and occasionally infeel I ~ 1 \[T1,6\] to get the knowledge of by touching with the fingers: ... 2 \[Wv6;T1\] to experience (the touch or movement of something): ... $ \[LT\] to experience (a condition of the mind or body); be consciously: ... 4 \[L1\] to seem to oneself to be: ... 5 \[T1,5;V3\] to believe, esp. for the moment 6 \[LT\] to give (a sensation): ... 7 \[Wv6;I0\] to (be able to) experience sensations: ... 8 \[Wv6;T1\] to suffer because of (a state or event): ... 9 \[L9 (after, \]or)\] to search with the fingers rather than with the eyes: ...</Paragraph>
    <Paragraph position="9"> Figure 5.</Paragraph>
    <Paragraph position="10"> see off v oA. IT1\] 1 \[(at)\] to go to the airport, station, etc., with (someone who is beginning a trip): saw h/s )'r/end oH at the bus #tat/on 2 to remain unharmed until (something or someone dangerous) has ceased to be active; WITHSTAND: They maw off $ enemy attacks within $ daye Figure 6 formation normally associated with one field of an entry is shifted into another to create a more compact or elegant printed entry.</Paragraph>
    <Paragraph position="11"> In addition to the 'noise' generated by the fact that we are working with a typesetting tape geared to visual presentation, rather than a database, there are errors and inconsistencies in the use of the grammar code system. Examples of errors, illustrated in Figure 7, include the code for the noun promise which contains a misplaced comma, that for the verb scream, in which a colon delimiter occurs before the end of the field, and that for the verb like where a grammatical label occurs inside a code field.</Paragraph>
    <Paragraph position="12"> p,o,-i.e, ... X \[C(of),C3.S; scream v ... 3 \[T1,5; (OUT); I0\] like v ... 2 \[T3,4; ne9.\]  In addition, inconsistencies occur in the application of the code system by different lexicographers. For example, when codes containing &amp;quot;to be&amp;quot; are elided they mostly occur as illustrated in Figure 4 above.</Paragraph>
    <Paragraph position="13"> However, sometimes this is represented as \[L(to be)l,9\]. Presumably this kind of inconsistency arose because one member of the team of lexicographers realised that this form of elision saved more space.</Paragraph>
    <Paragraph position="14"> This type of error and inconsistency arises because grammatical codes are constructed by hand and no automatic checking procedure is attempted (see Michiels, 1982, for further comment). One approach to this problem is that taken by the ASCOT project (Akkerman et al., 1985; Akkerman, 1986). In this project, a new lexicon is being manually derived from LDOCE. The coding system for the new lexicon is a slightly modified and simplified version of the LDOCE scheme, without any loss of generalisation and expressive power. More importantly, the assignment of codes for problematic or erroneously labelled words is being corrected in an attempt to make the resulting lexicon more appropriate for automated analysis. In the medium term this approach, though time consuming, will be of some utility for producing more reliable lexicons for natural language processing.</Paragraph>
    <Paragraph position="15"> 208 Computational Linguistics, Volume 13, Numbers 3-4, July-December 1987 Bran Boguraev and Ted Briscoe Large Lexicons for Natural Language Processing However, in the short term, the necessity to cope with such errors provides much of the motivation for our interactive approach to lexicon development, since this allows the restructuring programs to be progressively refined as these problems emerge. Any attempt at batch processing without extensive initial testing of this kind would inevitably result in an incomplete and possibly inaccurate lexicon.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 THE CONTENT OF THE GRAMMAR CODES
</SectionTitle>
    <Paragraph position="0"> Once the grammar codes have been restructured, it still remains to be shown that the information they encode is going to be of some utility for natural language processing. The grammar code system used in LDOCE is based quite closely on the descriptive grammatical framework of Quirk et al. (1972, 1985). The codes are doubly articulated; capital letters represent the grammatical relations which hold between a verb and its arguments and numbers represent subcategorisation frames which a verb can appear in. Most of the subcategorisation frames are specified by syntactic category, but some are very ill-specified; for instance, 9 is defined as &amp;quot;needs a descriptive word or phrase&amp;quot;. In practice many adverbial and predicative complements will satisfy this code, when attached to a verb; for example, put \[xg\] where the code marks a locative adverbial prepositional phrase vs. make under sense 14 (hereafter written make(14)) is coded IX9\] where it marks a predicative noun phrase or prepositional phrase.</Paragraph>
    <Paragraph position="1"> The criteria for assignment of capital letters to verbs is not made explicit, but is influenced by the syntactic and semantic relations which hold between the verb and its arguments; for example, I5, L5 and T5 can all be assigned to verbs which take a NP subject and a sentential complement, but L5 will only be assigned if there is a fairly close semantic link between the two arguments and T5 will be used in preference to I5 if the verb is felt to be semantically two place rather than one place, such as know versus appear. On the other hand, both believe and promise are assigned V3 which means they take a NP object and infinitival complement, yet there is a similar semantic distinction to be made between the two verbs; so the criteria for the assignment of the V code seem to be purely syntactic.</Paragraph>
    <Paragraph position="2"> Michiels (1982) and Akkerman et al. (1985) provide a more detailed analysis of the information encoded by the LDOCE grammar codes and discuss their efficacy as a system of linguistic description. Ingria (1984) comprehensively compares different approaches to complementation within grammatical theory providing a touchstone against which the LDOCE scheme can be evaluated.</Paragraph>
    <Paragraph position="3"> Most automated parsing systems employ grammars which carefully distinguish syntactic and semantic information, therefore, if the information provided by the Longman grammar code system is to be of use, we need to be able to separate out this information and map it into a representation scheme compatible with the type of lexicon used by such parsing systems.</Paragraph>
    <Paragraph position="4"> The program which transforms the LDOCE grammar codes into lexical entries utilisable by a parser takes as input the decompacted codes and produces a relatively theory neutral representation of the lexical entry for a particular word, in the sense that this representation could be further transformed into a format suitable for most current parsing systems. For example, if the input were the third sense of believe, as in Figure 4, the program would generate the (partial) entry shown in Figure 8 below. The four parts correspond to different syntactic realisations of the third sense of the verb believe. Takes indicates the syntactic category of the subject and complements required for a particular realisation. Type indicates aspects of logical semantics discussed below.</Paragraph>
    <Paragraph position="5">  At the time of writing, rules for producing adequate entries to drive a parsing system have only been developed for verb codes. In what follows we will describe the overall transformation strategy and the particular rules we have developed for the verb codes. Extending the system to handle nouns, adjectives and adverbs would present no problems of principle. However, the LDOCE coding of verbs is more comprehensive than elsewhere, so verbs are the obvious place to start in an evaluation of the usefulness of the coding system. No attempt has been made to map any closed class entries from LDOCE, as a 3,000 word lexicon containing most closed class items has been developed independently by one of the groups collaborating with us to develop the general purpose morphological and syntactic analyser (see the Introduction and Russell et al., 1986).</Paragraph>
    <Paragraph position="6"> Initially the transformation of the LDOCE codes was performed on a code-by-code basis, within a code field associated with each individual word sense. This approach is adequate if all that is required is an indication of the subcategorisation frames relevant to any particular sense. In the main, the code numbers determine a unique subcategorisation. Thus the entries can be used to select the appropriate VP rules from the grammar (assuming a GPSG-style approach to subcategorisation) Computational Linguistics, Volume 13, Numbers 3-4, July-December 1987 209 Bran Boguraev and Ted Briscoe Large Lexicons for Natural Language Processing and the relevant word senses of a verb in a particular grammatical context can be determined. However, if the parsing system is intended to produce a representation of the predicate-argument structure for input sentences, then this simple approach is inadequate because the individual codes only give partial indications of the semantic nature of the relevant sense of the verb.</Paragraph>
    <Paragraph position="7"> The solution we have adopted is to derive a semantic classification of the particular sense of the verb under consideration on the basis of the complete set of codes assigned to that sense. In any subcategorisation frame which involves a predicate complement there will be a non-transparent relationship between the superficial syntactic form and the underlying logical relations in the sentence. In these situations the parser can use the semantic type of the verb to compute this relationship.</Paragraph>
    <Paragraph position="8"> Expanding on a suggestion of Michiels (1982), we classify verbs as Subject Equi, Object Equi, Subject Raising or Object Raising for each sense which has a predicate complement code associated with it. These terms, which derive from Transformational Grammar, are used as convenient labels for what we regard as a semantic distinction; the actual output of the program is a specification of the mapping from superficial syntactic form to an underlying logical representation. For example, labelling believe(3) (Type 20Raising) indicates that this is a two place predicate and that, if believe(3) occurs with a syntactic direct object, as in (1) John believes the Earth to be round it will function as the logical subject of the predicate complement. Michiels proposed rules for doing this for infinitive complement codes; however there seems to be no principled reason not to extend this approach to computing the underlying relations in other types of VP as well as in cases of NP, AP and PP predication (see Williams (1980), for further discussion).</Paragraph>
    <Paragraph position="9"> The five rules which are applied to the grammar codes associated with a verb sense are ordered in a way which reflects the filtering of the verb sense through a series of syntactic tests. Verb senses with an \[it + 15\] code are classified as Subject Raising. Next, verb senses which contain a \[V\] or \[X\] code and one of \[D5\], \[D5a\], \[D6\] or \[D6a\] codes are classified as Object Equi. Then, verb senses which contain a \[V\] or \[X\] code and a IT5\] or \[T5a\] code in the associated grammar code field, (but none of the D codes mentioned above), are classified as Object Raising. Verb senses with a \[V\] or IX(to be)\] code, (but no IT5\] or \[T5a\] codes), are classified as Object Equi. Finally, verb senses containing a \[T2\], \[T3\] or IT4\] code, or an \[I2\], \[13\] or \[14\] code are classified as Subject Equi. Figure 9 gives examples of each type.</Paragraph>
    <Paragraph position="10"> The Object Raising and Object Equi rules attempt to exploit the variation in transformational potential between Raising and Equi verbs; thus, in the paradigm  (2) John believes that the Earth is round.</Paragraph>
    <Paragraph position="11"> (3) *John forces that the Earth is round.</Paragraph>
    <Paragraph position="12"> Secondly, if a verb takes a direct object and a sentential complement, it will be an Equi verb, as examples in (4) and (5) illustrate.</Paragraph>
    <Paragraph position="13"> (4) John persuaded Mary that the Earth is round.</Paragraph>
    <Paragraph position="14"> (5) *John believed Mary that the Earth is round.</Paragraph>
    <Paragraph position="15">  Clearly, there are other syntactic and semantic tests for this distinction, (see eg. Perlmutter and Soames, 1979:472), but these are the only ones which are explicit in the LDOCE coding system.</Paragraph>
    <Paragraph position="16"> Once the semantic type for a verb sense has been determined, the sequence of codes in the associated code field is translated, as before, on a code-by-code basis. However, when a predicate complement code is encountered, the semantic type is used to determine the type assignment, as illustrated in Figures 4 and 8 above. Where no predicate complement is involved, the letter code is usually sufficient to determine the logical properties of the verb involved. For example, T codes nearly always translate into two-place predicates as Figure 10 illustrates.</Paragraph>
    <Paragraph position="17"> In some cases important syntactic information is conveyed by the word qualifiers associated with particular grammar codes and the translation system is therefore sensitive to these correlations. For example, the Subject Raising rule above makes reference to the left  context qualifier &amp;quot;it&amp;quot;. Another example where word qualifiers can be utilised straightforwardly is with ditransitive verbs such as give and donate. Give is coded as \[Dl(to)\] which allows us to recover the information that this verb permits dative movement and requires a prepositional phrase headed by &amp;quot;to&amp;quot;: (Takes NP NP ToPP) and (Takes NP NP NP).</Paragraph>
    <Paragraph position="18"> On the other hand, donate is coded \[T1 (to)\], which tells us that it does not undergo dative movement but does require a prepositional phrase headed by &amp;quot;to&amp;quot;: (Takes NP NP ToPP).</Paragraph>
    <Paragraph position="19"> There are many more distinctions which are conveyed by the conjunction of grammar codes and word qualifiers (see Michiels, 1982, for further details). However, exploiting this information to the full would be a non-trivial task, because it would require accessing the relevant knowledge about the words contained in the qualifier fields from their LDOCE entries.</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 LEXICAL ENTRIES FOR PATR-II
</SectionTitle>
    <Paragraph position="0"> The output of the transformation program can be used to derive entries which are appropriate for particular grammatical formalisms. To demonstrate that this is possible we have implemented a system which constructs dictionary entries for the PATR-II system (Shieber, 1984 and references therein). PATR-II was chosen because it has been reimplemented in Cambridge and was therefore, available; however, the task would be nearly identical if we were constructing entries for a system based on GPSG, FUG or LFG. We  intend to use the LDOCE source in the same way to derive most of the lexicon for the general purpose, morphological and syntactic parser we are developing. The latter employs a grammatical formalism based on GPSG; the comparatively theory neutral lexical entries that we construct from LDOCE should translate straightforwardly into this framework as well.</Paragraph>
    <Paragraph position="1"> The PATR-II parsing system operates by unifying directed graphs (DGs); the completed parse for a sentence will be the result of successively unifying the DGs associated with the words and constituents of the sentence according to the rules of the grammar. The DG for a lexical item is constructed from its lexical entry whichcontains a set of templates for each syntactically distinct variant. Templates are themselves abbreviations for unifications which define the DG. For example, the basic entry and associated DG for the verb storm are illustrated in Figure 11.</Paragraph>
    <Paragraph position="2"> The template Dyadic defines the way in which the syntactic arguments to the verb contribute to the logical structure of the sentence, while the template TakesNP defines what syntactic arguments storm requires; thus, the information that storm is transitive and that it is logically a two-place predicate is kept distinct. Consequently, the system can represent the fact that some verbs which take two syntactic arguments are nevertheless one-place predicates.</Paragraph>
    <Paragraph position="3"> The modified version of PATR-II that we have implemented contains only a small dictionary and constructs entries automatically from restructured LDOCE entries for most verbs that it encounters. As well as carrying over the grammar codes, the PATR-II lexicon system has been modified to include word senses numbers, which are derived from LDOCE. Thus, the analysis of a sentence by the PATR-II system now represents its syntactic and logical structure and the particular senses of the words (as defined in LDOCE) which are relevant in the grammatical context. Figures 12 and 13 illustrate the dictionary entries for marry and persuade constructed by the system from LDOCE.</Paragraph>
    <Paragraph position="4"> In Figure 14 we show one of the two analyses produced by PATR-II for a sentence containing these two verbs. The other analysis is syntactically and logically identical but incorporates sense two of marry. Thus, the output from this version of PATR-II represents the information that further semantic analysis need only consider sense two of persuade and sense one and two of marry; this rules out one further sense of each, as defined in LDOCE.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML