File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/82/c82-1036_metho.xml

Size: 15,380 bytes

Last Modified: 2025-10-06 14:11:24

<?xml version="1.0" standalone="yes"?>
<Paper uid="C82-1036">
  <Title>APPROACHES TO THESAURUS PRODUCTION</Title>
  <Section position="3" start_page="0" end_page="227" type="metho">
    <SectionTitle>
I INTRODUCTION
</SectionTitle>
    <Paragraph position="0"> Since 1979 we have had available, by contract with LON~,I/uN Ltd, the c(:mputer tape of LDOCE (IDNCUqN DICTIO~NARY OF CONrlT/'IPOP~LRY \[iNGLISH). Our main concern has been the development of a syntactico-semantic analyzer of general &amp;quot;English making full use of all the formatted information contained in our dictionary file. (\lichiels et al. 1980; ~lichiels 1982).</Paragraph>
    <Paragraph position="1"> LDOCE is a medit~a-sized dictionary of core English containing some 60,OO0 entries which feature the following types of information : a) fully formalized Part of speech (POS) Grammatical fields, i.e. sets of grarmnatical codes, which describe the environment that the code-bearing item can or must fit in.</Paragraph>
    <Paragraph position="2"> l%%at makes these grammatical fields particularly suitable for the purposes of machine disambiguation of natural language is that they are assigned to word-senses (definitions) as well as to whole lexical entries. An example is provided by the LDOCE entry CONSIDER (p. 233).</Paragraph>
    <Paragraph position="3"> in the example, string I consider you a fool the two-NP chain ( YOU A FOOL ) satisfies the \[XI~ code associated with the NP I NP 2 second definition of the verb and enables the analyzer to select the appropriate definition in context (&amp;quot;scanning procedures&amp;quot; : cf. qichiels et al. 1980) Definition space, i.e.</Paragraph>
    <Paragraph position="4"> (i) semantic codes : inherent features for nouns, selectional restrictions for adjectives and verbs Consider the entry HA~C/4ER, verb. As the definition space does not appear in the printed version, weJrefer the reader to the computer file where, for the third definition, the semantic eodes indicate that bo~h the deep subject and the deep  object must be O~ ' i.e. 5~r'~kN\] .</Paragraph>
    <Paragraph position="5"> (ii) subject codes (~ld labels)  228 A. MICHIELS and J. NOI~L ix : In the entry H~M~, def. 3 is assigned SPXX iSports) and def. 5 BCZS (EC : Economics, Z : subdivision indicator, S : Stock ixchange and Investment). b) partly formalized In most dictionaries, definitions are nothing else but strings of natural language, albeit of a special type (Smith and Maxwell 1973; Amsler 1980, p. I08). A first step towards formalizing definitions has been taken by the LD(XIE lericographers : all the LIX)C~ examples and definitions are written in a controlled defining vocabulary of some 2,100 items (lexemes - e.g. HISTORY - and morphemes - e.g. RE- and -IZATION - no morphological variants). Our concern in this paper will be with how to produce thesauri from dictionary files. What prompts us to examine this problem is the existence of two contrasting approaches to thesaurus-production : the first is exemplified by LOLHX (LON6MAN</Paragraph>
  </Section>
  <Section position="4" start_page="227" end_page="227" type="metho">
    <SectionTitle>
LEXICON OF CONTEMPORARY 19~GLISH, J 981 ), the second by Amsler 1980.
II THESAURUS PRODUCTION
</SectionTitle>
    <Paragraph position="0"> Although LOLEX takes over a subset of the ~ definitions, both the choice of thesauric categories (e.g.J.212 verbs : DISMISSING AhD Rh-TIRING PEOPLE) and the assignment of a lexical item to one of several categories (e.g. DISBAND assigned to J. 212) are based on the lexicographer's intuition and knowledge of prcvlous work in the field (cf. l~get's, etc.).</Paragraph>
    <Paragraph position="1"> Amsler's approach is totally different (see Amsler 1980) : using as data base the computer files of the MPD (Merriam Pocket Dictionary) prepared by John O\]ney (Olrtey 1968), he develops an interactive procedure for thesaurus production. The first step is a manual selection and disambiguation of the GHqUS TEI~4S in the definitions of nouns and verbs. By GENUS TERM is to be understood the first word of the definition which has the same POS as the definiendum a~d can serve as its superordinate. For example, in the first definition of HAMMER, the genus term is STRIKE, whereas in the fifth it is DECLARE.</Paragraph>
    <Paragraph position="2"> It should be realized t~hat genus term and syntactic head do not always coincide, and this mismatch is a major obstacle in the development of autocratic procedures for genus term selection. Contrast in this respect tho first and the second homographs of the LDOCE headword BOA (page IO5). The second poses no problem : syntactic head and genus term are identical (GARMENT)deg In the first, however, the genus term is lodged inside the second OF-phrase,itself embedded in the first, which in its turn depends on the syntactic head ANY.</Paragraph>
    <Paragraph position="3"> Once they have been selected, the genus terms are disambiguated with reference to the data base itself by selecting the appropriate homograph and definition numbers. A convenient example, drawn from LDOCE, ~s the disambiguation of the genus term CONSIDER in the definitions of LOOK ON (L X 9 esp. as, wit~: to consider; regard) CONSIDER here will be disambiguated as CONSIDER (m, 2) (~ = non deg honDgraphic, 2 = second definition - cf. LDOCE entry CONSIDER, po 253) The next step is the use of a tree-growing algorithm, which Amsler has progr~ed and applied to his MPD data base. It is based on a filiation technique between l~xical entries and genus terms. We shall illustrate it with respect to the item VEHICLE (x, 1 ) in our own data base. Descending the filiation path, the procedure will select all the items which use ~he word V~HICLE (w, 1 ) as genus term in their definitions. Among these are CAR (x,'I/2/3) and CARRIAGE (x, I/2/7). CARRIAGE in tm'n functions as a genus term and yields its own sub-class, which contains, mnong others, the items BROUGHAM (x, x - non-homographic + a single definition) and GIG (1,1) - which are themselves defined by means of the genus term CARRIAGE. In our example, the procedure stops at B~ alxl GIG because these lexical i~-~s are nowhere in the ~Cti~ used as ~ terms. It results in a n,rti~l</Paragraph>
  </Section>
  <Section position="5" start_page="227" end_page="227" type="metho">
    <SectionTitle>
APPROACHES TO THESAURUS PRODUCTION 229
</SectionTitle>
    <Paragraph position="0"> taxo m headed by the item VI~IICLE :</Paragraph>
    <Paragraph position="2"> Going up the filiation path from the werd-sense VEHICLE (x, I ) aae finds as syntactic head the pro-form SO~ING - there is no genus term. Even if one is prepared to consider S(MEI~ING as the genus term (relaxing the HIS identity condition), the thesauric link that is obtained does not yield more information than the semantic codes associated with the relevant definition.</Paragraph>
    <Paragraph position="3"> A clear advantage of ~nsler's procedure over intuitive thesaurus-production (as exemplified in LOLIK) is that it can lead to an i~provement of the dictionary data base that is used as source. To take only one example : suppose that one is convinced that there should be a thesmn-ic link (hyponym - superordinate) between V\]~ICI~ and ~. If ~ is used as source data base for thesaurus production, the link in question will not be retrieved (INSTRIMENT is not used as genus term in the LDOCE definition of VEHICLE (x, 1)), which inevitabl~-~aises the question of whether or not to revise the definition of VEHICLE.</Paragraph>
  </Section>
  <Section position="6" start_page="227" end_page="227" type="metho">
    <SectionTitle>
III I~I%OITING ~ DEFINITIONS
</SectionTitle>
    <Paragraph position="0"> applied to the ~ definitions, Amsler's technique reveals an interesting consequence of a controlled defining vocabulary : the thesauric hierarchies are more shallow in ~ than in MPO (which does not feature a controlled defining vocabulary). To give an example, ~ defines LIMOUSINE by memos of the genus term  The shallow hierarchies based on LDOCE definitions are no doubt less revealing for the purpose of thesauric organisation. But the use of a controlled defining vecabulary makes it easier to process dictionary definitions in terms of both : I ) auto~mtizing genus term selection and disambiguation and 2) parsing whole definition strings (as opposed to I ) This is because the lexicon that the parser must have access to can be determined in advance. It is NOT open-ended (open-ended means, practically, as extensive as the defined vecabulary, i.e. the whole list of dictionary entries - cf. Amsler 1980, p. TOg).</Paragraph>
    <Paragraph position="1"> Schematically, the decision to use a controlled vocabulary to write dictionary definitions can have three undesirable consequences : I).- reduction of the amount of information conveyed by the definition : OVERUSE of i~licitly or explicitly partial definitions (in the sense of Bierwisch &amp; Kiefer 1969, p. 66-68) - the latter are incomplete definitions which wear  230 A. MICHIELS and J. NOeL their incompleteness on their sleeve, for em~ple : TARANqIF~ : spider of a certain kind.</Paragraph>
    <Paragraph position="2"> 2) .- semantic overloading of all-purpose items such as GET, HAVE) MAKE, TAKE, etc. E.g. K~P (1, 8) : to have for some time or for more time (LDOCE, p~ 605) 3) .- uncontrolled increase in s&gt;ntactic complexity in the differentia {non-genus part of the definition) : a) degree of embedding - not only in clauses, but also - and perhaps more importantly - in complex nominal groups (cf. Amsler 1980, p. 108 on ANT-EATING in the definition of AARDVARK) b) anaphoric relations c) scope relations (conjunction plays a pr~inent part here) Compare the following two definitions of INSULIN i) .- OALDOCE (Hornby 1980~ - 18 words substa~e (a hormone ) prepared from the pancreas ~ of sheep used in the medical treatment of sufferers from diabetes ~</Paragraph>
    <Paragraph position="4"> a substance produced naturally in the body which allows sugar to be used for ENEI~GY, esp. such a substance taken frc~ sheep to be given to sufferers from a disease (DIABETES) which makes them lack this substance.</Paragraph>
    <Paragraph position="5"> (ENI~GY and DIABETES in capital letters because not in LDOCE defining vocabulary).</Paragraph>
    <Paragraph position="6"> This third consequence stems from the avoidance of non-defining vocabulary items by means of P~E, which displaces the burden towards syntactic elaboration, a point cogently made in Ralph 1980 (p. 117).</Paragraph>
    <Paragraph position="7"> This &amp;quot;grammaticalization&amp;quot; of much of the information conveyed by LDOCE dictionary definitions points to the need to analyse whole definition strings rather than just the genus terms (see the process of ANNOTATING dictionary definitions in No~l et al. 1981).</Paragraph>
    <Paragraph position="8"> Before we consider how to tackle the problem of disambiguating definition strings, we must examine a much easier way of retrieving at least some thesauric links from the LDOCE dictionary file. The LDOCE lexicographers sometimes provide ready-made  thesauric links : I ).-cross-reference to an item belonging to the defining vocabulary : CAPTAIN (2, ~() : to be captain of; c~; synonyms 2) .- cross-reference to a non-defining vocabulary item : ABBEY (x, 1) : ...... ; MONASTERY or CONVEMf synonyms 3) .- cross-reference to a non-defining vocabulary item inside an LDOCE definition,  with a paraphrase in the defining vocabulary. An exa~le is to be found in the LDOCE definition of INSULIN quoted above : disease (DIABETES) which .... ~n~ genus term, $ supererdinate In No~l et al. 1981 and ~lichiels et al. 1981 we have shown the power of the IDOCE grmmnatical codes to disambiguate items in context, more specifically in the context provided by the definition strings themselves. For instance, in the LDOCE definition ~ ~ (~, D</Paragraph>
  </Section>
  <Section position="7" start_page="227" end_page="227" type="metho">
    <SectionTitle>
APPROACHES TO THESAURUS PRODUCTION 231
</SectionTitle>
    <Paragraph position="0"> - a wicked person who leads ~__ple t.o__dg._wf.ong or harms those who are kind to him the annotating process will select the V3 code for LEADS, because it occurs in the syntactic envirorrnent NP + TO + VP (NP = poople, VP = do wrong) defined by V3 . This assigrBnent enables the system to reject all the word senses for LEAD in LDOCE except the appropriate one (one out of nine; cf. entry L~I page 622). We would like here to put forward a further possible exploitation of the LDOCE grammatical codes for the purpose of dissmbiguating dictionary definitions. It applies to genus terms and consists in the selection of a preferred word-sense for the genus term on the basis of a similarity in grarmnatical code between definiens and genus term. Let us turn back to our fourth example, the entry LOOK ON (2, ~). The first genus term is CONSIDP~R. LOOK ON is assigned the granmmtical cede X9 . The second definition of CONSIDER is assigned the X (to be) 1, 7 code. The similarity in grammatical code X serves as criterion to disambiguate CONSIDER in the definition of LOOK ON as CONSIDER (x) 2).</Paragraph>
    <Paragraph position="1"> The LDOCE semantic and subject codes can be exploited in a similar way. It can be hypothesized that the combined use of all the formalized information types in LDOCE will prove to have a high disambiguating power and turn out to be a useful tool for the setting up of thesauric classes.</Paragraph>
    <Paragraph position="2"> A last point that we wish to touch on concerns the nature of the genus terms in a dictionary data base which makes use of a controlled defining vocabulary. The grmmnaticalization of information due to paraphrase in LDOCE gives rise to a special distribution of genus terms along a FULL WORD PROFORM gradient.</Paragraph>
  </Section>
  <Section position="8" start_page="227" end_page="227" type="metho">
    <SectionTitle>
FULL WORD
LIQUID SUBSTANCE
ANALYSIS
</SectionTitle>
    <Paragraph position="0"/>
  </Section>
  <Section position="9" start_page="227" end_page="227" type="metho">
    <SectionTitle>
ANYTHING
</SectionTitle>
    <Paragraph position="0"> cf. LDOCE def. of VEHICLE (x, I)</Paragraph>
  </Section>
  <Section position="10" start_page="227" end_page="227" type="metho">
    <SectionTitle>
PROCESS
ACTION
</SectionTitle>
    <Paragraph position="0"> As compared with MPD, for example, LDOCE genus terms tend to cluster toward the profof~ end of the gradient. When the point is reached where the genus term does not provide more specific information than the semantic codes assigned to the definiendun, two conclusions can be drawn : 1 ).- the lexicographers of the source c~ictionary must consider whether their definition is appropriate, as it does not show the thesauric links perspicuously; 2) .- the whole definition string must be processed and disambiguated, so as to retrieve the information that a dictionary which does not use a controlled defining vocabulary would have included in the genus term.</Paragraph>
    <Paragraph position="1"> At the same time, the analysis of whole definition strings will reveal a number of thesauric links (such as that between INSTR\[lqENT and ACTION discussed in Miqhiels et al. 1980) that the study of genus terms, limited to the HYPONYM~/PERORDINATE relation, is unable to retrieve.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML