File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/85/e85-1025_metho.xml
Size: 23,344 bytes
Last Modified: 2025-10-06 14:11:41
<?xml version="1.0" standalone="yes"?> <Paper uid="E85-1025"> <Title>TOWARDS A DICTIONARY SUPPORT ENVIRONMENT FOR REALTIME PARSING ABSTRACT</Title> <Section position="3" start_page="171" end_page="171" type="metho"> <SectionTitle> THE ACCESS ENVIRONMENT </SectionTitle> <Paragraph position="0"> To link the machine-readable version of LDOCE to existing natural language processing systems we need to provide fast access from Lisp to data held in secondary storage. Furthermore, the complexity of the data structures stored on disc should not be constrained in any way by the method of access, because we have little idea what form the restructured dictionary may eventually take.</Paragraph> <Paragraph position="1"> Our first task in providing an environment was therefore the creation ofa 'lispifed' version ofthe machine-readable LDOCE file. A batch program written in a general editing facility was used to convert the entrire LDOCE typesetting tape into a sequence of Lisp s-expressions without any loss of generality or information. Figure 1 illustrates part of an entry as it appears in the published dictionary, on the typesetting tape and after lispification.</Paragraph> <Paragraph position="2"> This still leaves the problem of access, from Lisp, to the dictionary entry s-expressions held on secondary storage. Ad hoc solutions, such as sequential scanning of files on disc or extracting subsets of such files which will fit in main memory are not adequate as an efficient interface to a parser. (Exactly the same problem would occur if our natural language systems were implemented in Prolog, since the Prolog 'database facility', refers to the knowledge base that Prolog maintains in main memory.) In principle, given that the dictionary is now in a Lispreadable format, a powerful virtual memory system might be able to manage access to the internal Lisp structures resulting from reading the entire dictionary; we have, however, adopted an alternative solution as outlined below.</Paragraph> <Paragraph position="3"> We have implemented an efficient dictionary access system which services requests for s-expression entries made by client Cambridge Lisp programs. The lispified file was sorted and converted into a random access file together with indexing information from which the disc addresses of dictionary entries for words and compounds can be recovered. Standard database indexing techniques were used for this purpose. The current access system is implemented in the programming language C. It runs under UNIX and makes use of the random file access and inter-process communication facilities provided by this operating system. (UNIX is a Trade Mark of Bell Laboratories.) To the Lisp programmer, the creation of a dictionary process and subsequent requests for information from the dictionary appear simply as Lisp function calls.</Paragraph> <Paragraph position="4"> We have provided for access to the dictionary via head words and the first words of compounds and phrasal verbs, either through the spelling or pronunciation fields. Random selection of dictionary entries is also provided to allow the testing of software on an unbiased sample. This access is sufficient to support our current parsing requirements but could be supplemented with the addition of further indexing files if required.</Paragraph> <Paragraph position="5"> Eventually access to dictionary entries will need to be considerably more intelligent and flexible than a simple left-to-fight sequential pass through the lexical items to be parsed, if our processing systems are to make full use of the information concerning compounds and idioms stored in LDOCE.</Paragraph> </Section> <Section position="4" start_page="171" end_page="172" type="metho"> <SectionTitle> RESTRUCTURING THE DICTIONARY </SectionTitle> <Paragraph position="0"> The lispified LDOCE file retains the broad structure of the typesetting tape and divides each entry into a number of felds head word, pronunciation, grammar codes, definitions, examples and so forth. However, each of these fields requires further decoding and restructuring to provide client programs with easy access to the information they require (Calzolari (1984) discusses this need). For this purpose the formatting codes on the typesetting tape are crucial since they provide clues to the correct structure of this information. For example, word senses are largely defined in terms of the 2000 word core vocabulary, however, in some cases other words (themselves defined elsewhere in terms of this vocabulary) are used. These words always appear in small capitals and can therefore be recognised because they will be preceded by a font change control character. In Figure 1 above the definition of&quot;rivet&quot; includes the noun definition of&quot;RIVETI&quot;, as signalled by the font change and the numerical superscript which indicates that it is the noun entry homograph; additional notation exists for word senses within homograhps. On the typesetting tape, font control characters are indicated within curly brackets by hexadecimal numbers. In addition, there is a further complication because this sense is used in the plural and the plural morpheme must be removed before &quot;RIVET&quot; can be associated with a dictionary entry. However, the restructuring program can achieve this because such morphology is always italicised, so the program knows that in the context of non-core vocabulary items the italic font control character signals the occurrence of a morphological variant of a LDOCE head entry.</Paragraph> <Paragraph position="1"> A suite of programs to unscramble and restructure all the fields in LDOCE entries has been written which is capab|e of decoding all the fields except those providing cross-reference and usage information for complete homographs. Figure 2 illustrates a simple lexical entry before and after the application of these programs.</Paragraph> <Paragraph position="2"> The development of the restructuring programs is a non-trivial task because the organisation of information on the typesetting tape presupposes its'visual presentation, and the ability of human users to apply common sense, utilise basic morphological knowledge, ignore minor notational inconsistencies, and so forth. To provide a test-bed for these programs we have implemented an interactive dictionary browser capable of displaying the restructured information in a variety of ways and representing it in perspicuous and expanded form.</Paragraph> <Paragraph position="3"> To illustrate the problems involved in the restructuring process we will discuss the restructuring of the grammar codes in some detail, however, the reader should bear in mind that this represents only one comparatively constrained field of an LDOCE entry and therefore, a small proportion of the overall restructuring task. Figure 3 (Illustrates the grammar code field for the third word sense of the verb &quot;believe&quot; as it appears in the published dictionary, on the typesetting tape and after restructuring.</Paragraph> <Paragraph position="4"> Multiple grammar codes are elided and abbreviated in the dictionary to save space and restructuring must reconstruct the full set of codes. This can be done with knowledge of the syntax of the grammar code system and the significance of punctuation and font changes. For example, semicolons indicate concatenated codes and commas indicate concatenated, elided codes. However, discovering the syntax of the system is dimcult since no explicit description is available from Longman and the code is geared more towards visual presentation than formal precision; for example, words which qualify codes, such as &quot;to be&quot; in Figure 3, appear in italics and therefore, will be preceded by the font control character &quot;45'. But sometimes the thin space</Paragraph> <Paragraph position="6"> (8 &quot;45 a *44 2 things that are alike or of the same kind !, and are usu ! used together : *46 a pair of shoes tJ a beautiful pair of legs *44 &quot;63 compare *CA COUPLE &quot;CB *8B *45 b *44 2 playing cards of the same value but of different *CA SUIT *CB *46 s *8A</Paragraph> <Paragraph position="8"> (8 *45 a &quot;44 2 people closely connected : *46 a pair of dancers *45 b *CA COUPLE *CB &quot;88 *44 (2) (esp t. in the phr !. *45 the happy pair *44) &quot;45 c *46 sl &quot;44 2 people closely connected who cause annoyance or displeasure : *46 You !'re a fine pair coming as late as this !!) control character &quot;64' also appears; the insertion of this code is based solely on visual criteria, rather than the informational structure of the dictionary. Similarly, choice of font can be varied for reasons of appearance and occasionally information normally associated with one field of an entry is shifted into another to create a more compact or elegant printed entry. In addition to the 'noise' generated by the fact that we are working with a typesetting tape geared to visual presentation, rather than a database, there are errors in the use of the grammar code system; for example, Figure 4 illustrates the code for the first sense of the noun &quot;promise&quot;.</Paragraph> <Paragraph position="10"> The occurrence of the full code &quot;C3&quot; between commas is incorrect because commas are clearly intended to delimit sequences of elided codes. This type of error arises because grammatical codes are constructed by hand and no automatic checking procedure is attempted (see Michiels, 1982). Finally, there are errors or omissions in the use of the codes; for example, Figure 5 illustrates the grammar codes for the listed senses of the verb &quot;upset&quot;.</Paragraph> <Paragraph position="11"> upset: for cat = v These codes correspond to the simple transitive and intransitive uses of &quot;upset&quot;; no codes are given for the uses of &quot;upset&quot; with sentential complements. Clearly, the restructuring programs cannot correct this last type of error, however, we have developed a system which is sufficiently robust to handle the other problems described above. Rather than apply these programs to the dictionary and create a new restructured file, they are applied on a demand basis, as required by the dictionary browser or the other client programs described in the next section; this allows us to continue to refine the restructuring programs incrementally as further problems emerge.</Paragraph> </Section> <Section position="5" start_page="172" end_page="176" type="metho"> <SectionTitle> USING THE DICTIONARY </SectionTitle> <Paragraph position="0"> Once the information ia LDOCE has been restructured into a format suitable for accessing by client programs, it still remains to be shown that this information is of use to our natural language processing systems. In this section, we describe the use that we have made of the grammar codes and word sense definitions.</Paragraph> <Paragraph position="1"> Grammar codes The grammar code system used in LDOCE is based quite closely on the descriptive grammatical framework of Quirk et al. (1972). The codes are doubly articulated; capital letters represent the grammatical relations which hold between a verb and its arguments and numbers represent subcategorisation frames which a verb can appear in. (The small letters which appear with some codes represent a variety of less important information, for example, whether a sentential complement will take an obligatory or optional complementiser.) Most of the subcategorisation frames are specified by syntactic category, but some are very ill-specified; for instance, 9 is defined as &quot;needs a descriptive word or phrase&quot;. In practice anything functioning as an adverbial will satisfy this code, when attached to a verb. The criteria for assignment of capital letters to verbs is not made explicit, but is influenced by the syntactic and semantic relations which hold between the verb and its arguments; for example, 15, L5 and T5 can all be assigned to verbs which take a NP subject and a sentential complement, but 15 will only be assigned if there is a fairly close semantic link between the two arguments and T5 will be used in preference to I5 if the verb is felt to be semantically two place rather than one place, such as &quot;know&quot; versus &quot;appear&quot;. On the other hand, both &quot;believe&quot; and &quot;promise&quot; are assigned V3 which means they take a NP object and infinitival complement, yet there is a similar semantic distinction to be made between the two verbs; so the criteria for the assignment of the V code seem to be syntactic.</Paragraph> <Paragraph position="2"> The parsing systems we are interested in all employ grammars which carefully distinguish syntactic and semantic information of this kind, therefore, if the information provided by the Longman grammar code system is to be of use we need to be able to separate out this information and map it into the representation scheme used for lexical entries used by one of these parsing systems. To demonstrate that this is possible we have implemented a system which constructs dictionary entries for the PATR-II system (Shieber, 1984 and references therein). PATR-II was chosen because the system has been reimplemented in Cambridge and was therefore, available; however, the task would be nearly identical if we were constructing entries for a system based on GPSG, FUG or LFG.</Paragraph> <Paragraph position="3"> The PATR-H parsing system operates by unifying directed graphs (DGs); the completed parse for a sentence will be the result of successively unifying the DGs associated with the words and constituents of the sentence according to the rules of the grammar. The DG for a lexical item is constructed from its lexical entry which will consist of a set of templates for each syntactically distinct variant.</Paragraph> <Paragraph position="4"> Templates are themselves abbreviations for unifications which define the DG. For example, the basic entry and associated DG for the verb &quot;storm&quot; are illustrated in Figure 6.</Paragraph> <Paragraph position="5"> word storm: word sense ~ <head trans sense-no> = 1 V Takes NP Dyadic worddag storm: \[cat: v head: \[aux: false trans: \[pred: storm sense-no: I argl: <DG15> = \[\] arg2: <DG16> = \[\]\]\] syncat: \[first : \[cat: NP head: \[trans: <DG15>\]\] rest: \[first: \[cat: NP head: \[trans: <DG16>\]\] rest: \[first: lambda\]\]\]\] Figure 6 The template Dyadic defines the way in which the syntactic arguments to the verb contribute to the logical structure of the sentence; thus, the information that &quot;storm&quot; is transitive and that it is logically a two-place predicate is kept distinct. Consequently, the system can represent the fact that some verbs which take two syntactic arguments are nevertheless logically one-place predicates.</Paragraph> <Paragraph position="6"> It is not possible to automatically construct PATR-II dictionary entries for verbs just by mapping one full grammar code from the restructured LDOCE entry into a set of templates. However, it turns out that if we compare the full set of grammar codes associated with a particular sense of a verb, following a suggestion of Michiels (1982), then we can construct the correct set of templates. That is, we can extract all the information that PATR-II requires concerning the subcategorisation and semantic type of verbs. For example, as we saw above, &quot;believe&quot; under one sense is assigned the codes T5 and V3; the presence of the T5 code tells us that &quot;believe&quot; is a 'raising-to-object' verb and logically two-place under the V3 interpretation. On the other hand, &quot;persuade&quot; is only assigned the V3 code, so we can conclude that it is three-place with object control of the infinitive. By systematically exploiting the collocation of different codes in the same field, it is possible to distinguish the raising, equi and control properties of verbs. In effect, we are utilising what was seen as the transformational consequences of the semantic type of the verb within classical generative grammar.</Paragraph> <Paragraph position="7"> The modified version of PATR-II that we have implemented contains a small dictionary and constructs entries automatically from restructured LDOCE entries for most verbs that it encounters. As well as carrying over the grammar codes, PATR-II has been modified to represent the word sense numbers which particular grammar codes are associated with. Thus, the analysis of a sentence by the PATR-II system now represents its syntactic and logical structure and the particular senses of the words (as defined in LDOCE) which are relevant in the grammatical context. Figure 7 illustrates the dictionary entries for &quot;marry&quot; and &quot;persuade&quot; constructed by the system from LDOCE.</Paragraph> <Paragraph position="8"> In Figure 8 we show one of the two analyses produced by PATR-II for a sentence containing these two verbs. The other analysis is syntactically and parse: uther might persuade gwen to marry cornwall analysis 1 : \[cat: SENTENCE head: \[form: finite agr: \[per: p3 hum: sg\] aux: true trans: \[pred: possible sense-no: 1 argl: \[pred: persuade sense-no: 2 argl : \[ref: uther sense-no: 1\] arg2: \[ref: gwen sense-no: 1\] arg3: \[pred: marry sense-no: 2 arg1: \[ref: gwen sense-no 1 \] arg2: \[ref: cornwall sense-no: 1 \]\]\]\]\]\] Figure 8 logically identical but incorporates sense two of &quot;marry&quot;. Thus, the system knows that further semantic analysis need only consider sense two of &quot;persuade&quot; and sense one and two of &quot;marry&quot;; this rules out one further sense of each, as defined in LDOCE.</Paragraph> <Paragraph position="9"> Word sense definitions The automatic analysis of the definition texts of LDOCE entries is aimed at making the semantic information on word senses encoded in these definitions available to natural language processing systems. LDOCE is particularly suitable to such an endeavour because of the 2000 word restricted definition vocabulary, and in fact only 'central' senses of the words in this restricted vocabulary occur in definition texts. It is thus possible to process the LDOCE definition of a word sense in order to produce some representation of the sense definition in terms of senses of words in the restricted vocabulary. This representation could then be combined, for the benefit of the client language processing system, with the other semantic information encoded for word senses in LDOCE; in particular the 'box codes' that give simple selectional restrictions and the 'subject codes' that classify senses according to subject area usage. (These are not in the published version of the dictionary, but are available on the tape.) There are various possibilities for the form of the output resulting from processing a definition. The current experimental system produces output that is convenient for incorporating new word senses into a knowledge base organized around classification hierarchies, as discussed shortly. However, the system allows the form of output structures to be specified in a flexible way. Alternative possible output representations would be meaning postulates and definitions based on semantic primitives.</Paragraph> <Paragraph position="10"> As mentioned above, the implemented experimental system is intended to enable the classification (see e.g. Schmolze, 1983) of new word senses with respect to a hierarchically organized knowledge base, for example the one described in Alshawi (1983). The proposal being made here is that the analysis of dictionary definitions can provide enough information to link a new word sense to domain knowledge already encoded in the knowledge base of a limited domain natural language application such as a database query system. Given a hand-coded hierarchical organization of the relevant (central) senses of the definition vocabulary together with a classification of the relationships between these senses and domain specific concepts, the LDOCE definition of a new word sense often contains enough information to enable the inclusion of the word sense in this classification, and hence allow the new word to be handled correctly when performing the application task.</Paragraph> <Paragraph position="11"> The information necessary for this process is present, in the case of nouns, as restrictions on the classes which subsume the new type of object, its properties, and predications often expressed by relative clauses. There are also a number of more specific predications (such as &quot;purpose&quot; in the example given below) that are very common in dictionary definitions, and have immediate utility for the classification of the relationships between word senses. Similarly, the information relevant to the classification of verb and adjective senses present in sense definitions includes the classes of predicates that subsume the new predicate corresponding to the word sense, restrictions on the arguments of this predicate, and words indicating opposites as is frequently the case with adjective definitions.</Paragraph> <Paragraph position="12"> Figure 9 below shows the output produced by the implemented definition analyser for lispified LDOCE definitions of one of the noun senses and one of the verb senses of the word &quot;launch&quot;. It should be emphasized that the output produced is not regarded as a formal language, but rather as an intermediate data structure containing information relevant to the classification process.</Paragraph> <Paragraph position="13"> (launch) (a large usu. motor-driven boat used for carrying people on rivers, lakes, harbours, etc .)</Paragraph> </Section> <Section position="6" start_page="176" end_page="176" type="metho"> <SectionTitle> ((CLASS BOAT) (PROPERTIES (LARGE)) (PURPOSE (PREDICATION (CLASS CARRY) (OBJECT PEOPLE)))) </SectionTitle> <Paragraph position="0"> (to send (a modern weapon or instrument) into the sky or space by means of scientific explosive apparatus) The analysis process is intended to extract the most important information from definitions without necessarily having to produce a complete analysis of the whole of a particular definition text since attempting to produce complete analyses would be difficult for many LDOCE definition texts. In fact the current definition analyser applies successively more specific phrasal analysis patterns; more detailed analyses being possible when relatively specific phrasal patterns are applied successfully to a definition. A description of the details of this analysis mechanism is beyond the scope of the present paper.</Paragraph> <Paragraph position="1"> Currently, around fifty phrasal patterns are used altogether for noun, verb, and adjective definitions. A major difficulty encountered so far in this work stems from the liberal use in LDOCE definitions of derivational morphology and phrasal verbs which greatly expands the effective definition vocabulary.</Paragraph> </Section> class="xml-element"></Paper>