File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/80/c80-1057_metho.xml

Size: 24,791 bytes

Last Modified: 2025-10-06 14:11:17

<?xml version="1.0" standalone="yes"?>
<Paper uid="C80-1057">
  <Title>EXPLOITIN(~ A LARGE DATA BASE By LONG\IAN i</Title>
  <Section position="2" start_page="0" end_page="0" type="metho">
    <SectionTitle>
SECTION I. DESCRIPTION OF THE COxlPUTER FILES ..........................................
</SectionTitle>
    <Paragraph position="0"> A contract with LON@IAN Ltd has made it possible for us to have access to the computer files of two dictionaries, LDOCE (LONDON</Paragraph>
  </Section>
  <Section position="3" start_page="0" end_page="374" type="metho">
    <SectionTitle>
DICTIONARY OF CONTF~IPORARY ENGLISH) and LDOEI
(LON(~IAN DICTIONAI~OF ENGLISII IDIOMS). I%
</SectionTitle>
    <Paragraph position="0"> have had the LDOCE file for some time but have only just received the LDOEI one.</Paragraph>
    <Paragraph position="1"> spe~c~aThe features LDOCE make of which it lly useful for language processing are the following : a) it reflects the surface structure environment of its entries by means of a sophisticated system of grammatical codes, most of which can be thought of as strict subcategorization features. For instance, IDOCE specifies i.- that nouns like FACT or CLAIM can be followed by a THAT-clause, 2.- that a verb such as WATCH can occur followed by an NP followed by an ING-form (we watched the soldiers bleeding).</Paragraph>
    <Paragraph position="2"> Though it is mainly concerned with SURFACE structure, LDOCE nevertheless distinguishes between an NP pair follow in~ GIVE (He gaveJhi.s b.rotl!e~la new bicycle) \[DJcode NP I N 2 and one following CO~SIDER (He considered this brother, a f~) ~X1lcode T NP 1 NP 2 -b) through a system of semantic codes of the Katz-and-Fodor type (these codes do ~mt appear in the printed version of the dictionary), LDOCE places semantic restrictions on the subjects and objects of verbs (or on the type of noun that an adjective can modify), speci- + ing for instance that PERSUADE requires a t~A~ object, and EXTFNPORIZE a \[+ ItU~4AN~ subject.</Paragraph>
    <Paragraph position="3"> c) LDOCE makes use of a defining vocabulary of some 2,000 items - all the definitions and all the examples associated with the 60,000 entries are couched in that restricted vocabulary. null Concerning points a and b it should be emphasized that the gra~natical and semantic codes can appear at two different levels : i.- ENTRY level : the code is appropriate to all the definitions of the entry in question, null 2.- DEFINITION level : the code is not appropriate to the whole entry (i.e. in all its senses) but only to those readings that correspond to the definitions that the code is tagged to.</Paragraph>
    <Paragraph position="4"> For instance, READ cannot be assigned the same grammatical and semantic codes in sentences 1 and 2 :  1.- He rnana~e~ tr~ ro.ad nt \]~a.~t r~no honk every day 2.- Your paper doesn't read too well.</Paragraph>
    <Paragraph position="5">  This second level makes it possible to avoid a proliferation of indiscriminate disjunctions in the specification of the codes to be associated with a given lexeme. It seems to us that by restricting the occurrence of code specifications at only one level (nmnely, the  i~NTRY level), one reduces the predictive power of both grammatical and semantic codes to practically nil in the case of complex entries. On the ot\]~r hand, the codes that are appropriate at DEFINITION level provide an interesting type of correlation between strict sub-categorization and selection rules on the one hand and choice of appropriate reading on the other : such a type of correlation is bound to prove very useful for machine translation purposes. null voc~au ~ing to the use of the same defining lary, LDOEI is a natural extension of LDOCE. Whereas the latter merely lists the idiomatic phrases under the relevant headwords, LDOEI gives the information necessary for recognizing and generating all the syntactic and morphological variants of each idi~n. To give only one example, in the entry &amp;quot;TELL deg I WHERE TO GET OFF IV : Pass 2\]&amp;quot; the sign o indicates that ~LL admits of morphological variation in this phrase, I specifies the place of the indirect object (which does not belong to the idiomatic phrase as such) and the grammatical note iV : Pass 2Jinforms the user that the syntactic value of the ~4~ole phrase is verbal (i. e. that it functions as a VP) and that the passive is to be formed by selecting the indirect object as subject (,'Ib was told where to get off&amp;quot;).</Paragraph>
  </Section>
  <Section position="4" start_page="374" end_page="374" type="metho">
    <SectionTitle>
(LONGMAN LEXICON)
</SectionTitle>
    <Paragraph position="0"> ~\]is forthcoming thesaurus is also designed to tie in with LDOCE, of which it is partly a by-product. As Section III will make clear, our analysis of LDOCE definitions will have to rely on a thesaurus, but we do not know yet whether LOLEX will be available in machine-readable form.</Paragraph>
  </Section>
  <Section position="5" start_page="374" end_page="374" type="metho">
    <SectionTitle>
SECTION II. TOWAPd)S A S~4ANTICALLY ENRICI~JD
SURFACE PARSER BASED ON LDOCE
</SectionTitle>
    <Paragraph position="0"> I.- ~_!~!~!!z_~_~.</Paragraph>
    <Paragraph position="1"> It stands to reason that automatic parsing programmes have to have access to at least two linguistic components : a grammar and a lexicon. In most systems that we know something about, the gra~nar is a good deal more sophisticated than the lexicon. The latter includes only a small sub-part of the total lexicon for the language under study, while the grsmar takes care of a large proportion of the basic gran~natical structures.</Paragraph>
    <Paragraph position="2"> We would like to explore a diametrically opposed approach : our starting-point is a sophisticated lexicon for co:re English and our aim is to make maximum use of the information it contains to keep our grmmnar within strict bounds.</Paragraph>
    <Paragraph position="3"> An obvious first step in developing a parser based on LDOCE is to write algorithms that translate the various grammatical codes into scanning procedures . Most of these algorithms are fairly straightforward and have already been written. What we would like to focus on here is the simplification of the categorial component that such a lexically based syntax permits. Consider 3 : 3. The claim that he has succeeded is patently false. Since there is a code (namely, 5 ) that stipulates whether an element (in this case, a countable noun coded C - the whole code is therefore \[~51 ) can be followed by a ~IAT-clause, we will not attempt to account for T~T-clauses via rewrite rules for the category NP, i. e. we won't have such a rule as :</Paragraph>
  </Section>
  <Section position="6" start_page="374" end_page="374" type="metho">
    <SectionTitle>
NP---~NP ~T S
</SectionTitle>
    <Paragraph position="0"> Naturally enough, there is no LDOCE code stipulating that a noun can be followed by a relative clause (such a code would be meaningless since virtually all nouns can have a relative clause - if not a restrictive, thln at least an appositive one - tagged on their right). We will therefore have to include relative clauses somewhere in our rewrite rules for the category NP.</Paragraph>
    <Paragraph position="1"> Here too, however, the lexical approach to syntax can prove useful. To show this, let us first define a CONCATENATION as a string every member of which is tied to some other by means of a LDOCE grammatical code (it requires the other member for the satisfaction of its code or it serves to satisfy the other member's code). The concept of CONCATENATION can be equated with that of CLAUSE if it is extended to cover : i.- free elements, i, e. elements which are not bound to one particular word or phrase inside the clause (both sentential adjuncts and linking words such as conjunctions would fall into this category).</Paragraph>
    <Paragraph position="2"> 2.- a subject role, i. e. the creation of a link between a tensed V (the starting-point for the concatenation - see below) and an NP to be found on its right or on its left.</Paragraph>
    <Paragraph position="3"> We have already looked into the mechanisms of tensed V searches and subject role assignments and we have found that various properties of English make the task of algorithmizing these mechanisms less formidable than it appears at first sight. The most prominent among these properties are the following : I.- the conditions of use of the auxiliary DO; 2.- the fact that only tensed Vs require a subject; 3.- the fact that only the first (i. e. leftmost) member of a verbal ~roup can bear tense; 4.- the fact that it must bear tense; S.- the morphological contrast between verb and noun with respect to m~ber (- S marks singular verbs but plural nouns).</Paragraph>
    <Paragraph position="4"> Turning now to relative clauses, we see that we can characterize them with great ease : a relative clause is s concatenation that opens with a relative phrase (one of whose realizations is ~ and another the multi-purpose word THAT, so that a recognition procedure based on  375the occurrence of particular morphemes is bound to fail in some cases) and that misses an NP (it is this second property that has to be regarded as essential).</Paragraph>
    <Paragraph position="5"> The readers who are familiar with Hudson 1976 will have realized that the approach advocated here is nearer to Hudson's version of systemic gra~nar than to transfornmtional grammar : we make full use of sister-dependencies, starting with the tensed V, which we believe to provide the best entry-point into the network of relationships woven by the various code-bearing elements in a sentence.</Paragraph>
    <Paragraph position="6"> II.- _Dee_~_st~Ljcture_conf_igkjratitins _.</Paragraph>
    <Paragraph position="7"> It is obvious that our parser will have to be able to : I.- recognize the situations in which the basic order of the constituents (i. e. the one stipulated in the scanning procedures associated with the gra~atical codes) is disrupted under the effect of transformations such as PASSIVIZATION, TOPICALIZATION, PJ~LATIVE CLAUSE FOt~IATION, GAPPING, ...) 2.- keep track of the constituents that have been moved.</Paragraph>
    <Paragraph position="8"> We do not intend to deal with these points here but we would like to stress that the problems for RECOGNITION are very different from those for GENERATION. RAISING and EQUI, for instance, are rather formidable and problem-ridden rules from the point of view of generation but we shall argue that we do not need their counterparts for recognition purposes. We shall illustrate this point by looking at verb complementation - at the same time we will show that the syntactic potential of a verb can be used as a guide to its deep structure configuration.</Paragraph>
    <Paragraph position="9"> In a VP the SYNTACTIC head is always the first, i. e. tensed verb. As we have seen, the way the parser builds up concatenations reflects this property. As for the SEMANTIC head, it is very often another verb than the first one. This, however, does not matter in so far as the auxiliaries and semi-m~xiliaries (IIAPPiZ~4, SEEM, ...) do not have any semantic code associated with them and can therefore be regarded as semantically transparent : they have no effect whatsoever on the pailts that the semantic component will be called on to examine for compatibility. Consider such a sentence</Paragraph>
    <Paragraph position="11"> 4.- b{y father seems to have been reading too many strips.</Paragraph>
    <Paragraph position="12"> 11-te starting-point for building the concatenation would be the tensed V, i. e. SED{S : the concatenation would be allowed to grow both to the left (assigrunent of subject role to the NP 'my father') and to the right : he~appropriate syntactic code for SEI~,IS is 3J here (i. e. followed by an infinitive with TO) : g.~ ub~j e~t fathers seem~ NP/ ~atisfigs g3\] code of SED.~S SI!Di is not coded semantically, so that the semantic component would not be called on at this stage. In the next step, IIAVE would be examined and its ~I ~ code seen to be applicable ~ 8Jspecifies that the code-bearing element be followed by an EN-form) so that a new sister dependency would be established : _ ./4---~ ~ My father seems to have |been J sa~tisfies CI 8J code of HAVE In similar fashion, BEEN would have an ~13~ code (i. e. + ING-form) satisfied by</Paragraph>
  </Section>
  <Section position="7" start_page="374" end_page="374" type="metho">
    <SectionTitle>
IIEADING :
</SectionTitle>
    <Paragraph position="0"> !ly lather seems to nave been reading Neither HAVE nor BE are semantically coded with restrict to the definitions that have been chosen onYDasis of the grammatical codes that are satisfied in the sentence .~ READING on the other hand, will be coded sy~ttactically (it requires one NP as object-code ~Ti |) and semantically (it requires a ~ ~ANJ subject).</Paragraph>
    <Paragraph position="1"> Since SED4, HAVE and BEEN are semantically transparent, the semantic component will examine the pair ~JX father and re.ading and find them to be compatible as a subject-verb configuration. But how does the parser know that fathe__j is the subject of reading ? A very simple-minded rule states that there is no change in subject in a verbal complex as long as there is no interrupting NP; if there is one, it is to be regarded as the subject of the  following verb(s) : I want to read -~ I started to read ~ subject of READ &amp;quot;i happened to be reading.</Paragraph>
    <Paragraph position="2"> Y want g~. to ready ~&amp;quot; I saw you reading~you subject of READ I made ~ read This rule admits of at least one exception, namely PROMISE :  I promised you to read (I subject of READ in spite of interrupting YOU).</Paragraph>
    <Paragraph position="3"> Another problem relating to deep structure configurations is that of determining, in v + NP + J (TO) + INFINrrIVE l</Paragraph>
  </Section>
  <Section position="8" start_page="374" end_page="374" type="metho">
    <SectionTitle>
+ ING-FOR~ J
</SectionTitle>
    <Paragraph position="0"> structures, whether the NP is to be regarded as the object of the V or not (contrast 'I want him to go' with 'I persuaded him to go').</Paragraph>
    <Paragraph position="1"> Instead of going into each deep structure distinction that can be drawn within the field of verb complementation, we will show that the verb classes which Akmajian and Heny 1975 (p. 364 and fell.) find it necessary to set up in their introduction to transformational grammar to account for deep structure distinctions (Figure i) can be held apart on the basis of their surface structure potential as captured in their LDOCE gran~natical codes.</Paragraph>
    <Paragraph position="2">  The raised numbers on the features in the matrix below refer to the following  list of test sentences : I. I want to go 2. I want him to go 3. a) ? I want that he should go b) * I want that he goes 4. * I persuaded to go 5. I persuaded him to go 6. * I persuaded that he went 7. * I believe to have gone 8. I believe him to have gone 9. I believe that he has gone 10. I failed to go 11. * I failed him to go  The NP following the verb is its deep object only in the case of Class II verbs (I persuaded him to go ~I persuaded him); there is no NP in Class IV (* I failed him to go) and the NP is not the object in Class I or in Class III (I want him to go ~I want him; I believe him to have gone-4-~I believe |r him) .</Paragraph>
    <Paragraph position="3"> As for PROMISE (not discussed in Akmajian and Heny 1975) it could be defined by means of the following feature row : + T3, + T5, + V 3 : I promised to go (T3) I promised him to ~o (V3) I promised that I would go (T5) The NP between PROMISE and the TO-INFINITIVE is the object (as in the PERSUADE class) but it is not the sub-ject of the infinitive.</Paragraph>
  </Section>
  <Section position="9" start_page="374" end_page="374" type="metho">
    <SectionTitle>
SECTION THREE : LDOCE DEFINITIONS : AN
IR APPROACH TO SEMANTIC AND KNOWLEDGE-OF-
THE-WORLD INFORMATION.
</SectionTitle>
    <Paragraph position="0"> / LDOCE definitions convey semantic information in a fairly explicit, but nonformatted, form. Even though all definitions are written in a DEFINING VOCABULARY (not to be confused with a BASIC VOCABULARY - see below), no attempt has been made to stick to a limited number of DEFINING FORMULAE. To givean example of what we mean by DEFINING FORMULA, and to anticipate on what will be the main concern of this section, we wish to look at the class of INSTRUMENTS. In theory, it could be agreed by the dictionarymakers that all instruments have to include the phrase &amp;quot;instrument used for Ving&amp;quot; in their definitions. In such a defining formula the word INSTRUMENT would be a DEFINING PRIMITIVE and the predicate USED FOR would be a DEFINING RELATION (in this case, between an instrument and a predicate). Such a kind of formatted definition would be less precise and less exact, but infinitely more usable, than a common type definition. Smith and Maxwell 1973 (p2) point out that in a typical dictionary approximately 50 % of the vocabulary appears in the definitions. LDOCE is a major improvement on such a typical dictionary in that its defining  --377-vocabulary is restricted to some 2,000 items (used to define some 60,000 entries). My purpose in this section is to reflect on the possibility of turning a significant number of LDOCE definitions into fully formatted ones (i.e. making use of defining formulae).</Paragraph>
    <Paragraph position="1"> Consider the sentence : I saw the man in the park with a telescope \[Woods in Rustin 1973, p. 17~ The PREFERRED reading is the one that associates 'with a telescope' with the predicate 'saw' rather than with either of the NP heads 'man' or 'park' : 'saw with a telescope' rather than 'man with a telescope' or 'park with a telescope'. If we had available a formatted definition of TELESCOPE (&amp;quot;instrument used for seeing ...&amp;quot;), there would be no problem in a system of preferential semantics : the link between 'saw' and 'telescope' (embodied in the definition of the latter) would lead to the selection of the preferred reading on the basis of the DENSEST MATCH FIRST principle. As a matter of fact, the LDOCE definition for 'telescope' is very nearly what we need : &amp;quot;a tubelike scientific instrument used for seeing distant objects by making them appear nearer and larger&amp;quot; A simple matching procedure between our suggested defining formula for instruments and the LDOCE definition for 'telescope' would have been sufficient in this case. The problem, of course, is that there is absolutely no guarantee that the defining formula will be part of the definition of all instruments. HAMMER, for instance, is defined as : &amp;quot;a tool with a heavy head for driving nails into wood or for striking things to break them or move them&amp;quot; (Definition I) No simple procedure will associate INSTRUMENT with HAMMER. The fact that LDOCE makes use of a defining vocabulary, however, ensures that the defining noun (TOOL in this case) is a member of a finite list, namely the LDOCE defining vocabulary itself. One can go a step further and make the hypothesis that the defining noun will belong to a definite subset within the defining vocabulary. One can go through that vocabulary and select the words that could stand for INSTRUMENTS. The subset that this procedure yields can fairly easily be divided into two further groups : on the one hand one finds such general words as TOOL and APPARATUS (note that the latter would not be included in a BASIC VOCABULARY) which could also be used in defining formulae; on the other hand one has to include such specific items as BOAT, BICYCLE and GUN, which are instances of instruments. The second group is of course much more problematic than the first : one has to be concerned with TYPICAL instruments, otherwise all PHYSICAL OBJECTS would have to be included : He hit her with the tail of a dead snake.</Paragraph>
    <Paragraph position="2"> The INSTRUMENT reading of the 'with' -phrase is not due to any intrinsic property of either 'tail' or 'snake', but rather to four factors :  --378-a) WITH often introduces an instrumental adjunct; b) the 'with' -phrase in this sentence cannot be read as postmodifying 'her'; c) it cannot be read as an accompaniment adjunct for 'he' either; d) the predicate 'hit' can take an instrumental adjunct.</Paragraph>
    <Paragraph position="3"> The reader will have noticed that factors a, c and d also apply - mutatis mutandis - to the example involving the predicate SEE. This, however, does not imply that the link between TELESCOPE and SEE was of no use in preferring the instrument reading for the 'with' -phrase - note that 'with a telescope' COULD postmodify the NP heads 'man' and 'park'; besides, even if it could not, we would still have to find a way of telling the system and this task may well prove considerably more formidable than that of associating instruments and predicates.</Paragraph>
    <Paragraph position="4"> The following items in the LDOCE defining vocabulary could be regarded as making up the subset for the  more general and could perhaps be singled out in a third group, intermediate between I and II.</Paragraph>
    <Paragraph position="5"> Obviously, the lists as such are not sufficient for our purpose : words such as SPRING and MEDICINE are not relevant to the INSTRUMENT concept in some of their most frequent uses - for our purposes the defining vocabulary should not have been limited to a list of LEXICAL ITEMS; in case of polysemic words, numbers should have been added to make clear which definitions were to be associated with the defining word : SPRING I (= a source), 2 (= a season), 4 (= elasticity), 5 (= an active healthy quality) and 6 (= an act of springing), are not relevant to the INSTRUMENT concept. Since - in theory -the noun SPRING can be used with all six meanings in LDOCE definitions, its inclusion in our list is liable to prove detrimental : it can lead the system to associate the INSTRUMENT concept with a defining word that has nothing to do with instrumentality* Going back to the LDOCE definition for HAMMER, we realize that the algorithm that will associate instruments and predicates will have to take into account, not only the Ving form (in the formula 'for Ving'), but also its object; otherwise a hammer is going to be thought of as a kind of vehicle : Compa r e a tool ... for driving DRIVE 1 2/3 in LDOCE with a tool ... for driving DRIVE 1 5/6 in LDOCE nails A second difficulty that we must face up to is that there may be no defining NOUN, but an all-purpose indefinite such as SOMETHING or ANYTHING. In that case, however, the INSTRUMENT concept is likely to be expressed somewhere else in the definitions, by means of (USED) FOR Ving, for instance. This last point leads us to an examination of the various ways in which the link between instrument and predicate can be conveyed; the existence of a defining vocabulary is a help but the range of SYNTACTIC possibles remains enormous; however, there is something that could be called the LEXICOGRAPHICAL TRADITION and familiarity with that tradition can help cut down on the number of possible formulae - the following stand a good chance of being rather heavily used :</Paragraph>
  </Section>
  <Section position="10" start_page="374" end_page="374" type="metho">
    <SectionTitle>
USED TO V
(USED) FOR VING
</SectionTitle>
    <Paragraph position="0"> Obviously, processing LDOCE definitions is a lot of work in terms of the necessary algorithms and in terms of the sheer volume of language data to be scrutinized. We suggest that a useful approach is provided by IR (Information Retrieval) techniques as embodied in  --380-the IBM system known as STAIRS. STAIRS processes various objects, which can be worked into the following  The various paragraphs of a given document can be assigned labeL5, so that the search within a single document can be oriented.</Paragraph>
    <Paragraph position="1"> STAIRS provides a number of SEARCH OPERATORS, which will be briefly characterized below. A ~77~,PS/ search operator can be used to link any o{ the following three categories : I. word tokens, e.g. DISEASES, APPLIES,</Paragraph>
  </Section>
  <Section position="11" start_page="374" end_page="374" type="metho">
    <SectionTitle>
COMPUTERIZED, IF~
</SectionTitle>
    <Paragraph position="0"> 2.a) stems, e.g. RUN-, ANTAGONIZ-, MOTHER- (the use of a character mask enables the system to assign RUNNING RUNS, RUNNER, RUNNERS, etc. to the stem RUN-) and b) lexemes for which STAIRS generates the morphological variants~ 3. any expression consisting of elements of type I or 2 linked by STAIRS</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML