File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/90/c90-3025_metho.xml
Size: 25,751 bytes
Last Modified: 2025-10-06 14:12:29
<?xml version="1.0" standalone="yes"?> <Paper uid="C90-3025"> <Title>Is there content in empty heads?*</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 2. Background </SectionTitle> <Paragraph position="0"> Dictionary definitions of nouns are normally written in such a way that one can identify a &quot;genus term&quot; for the headword (the word being defined) via an IS-A relation. The information following the genus term, the differentia, serves to differentiate the * ~Pnis research was supported by the New Mexico State University Computing Research Laboratory through NSF Grant No. 1RI-8811108 -- Grateful acknowledgement is accorded to all the members of file CRL Natural Language Group for their comments and suggestions.</Paragraph> <Paragraph position="1"> 138 -1o headword from other headwords with the same genus. For example, (from LDOCE): knife - a blade fixed in a handle, used for cutting as a tool or weapon.</Paragraph> <Paragraph position="2"> Here &quot;blade&quot; is the genus term of the headword &quot;knife&quot; and &quot;fixed in a handle, used for cutting as a tool or weapon&quot; yields differentia. In other words, a &quot;knife&quot; IS-A &quot;blade&quot; (genus) distinguished from other blades by the features of its differentia. In order to create a taxonomy of word senses, this genus term must be identified and also sense-tagged (in this case, by ruling out blade of grass, propeller blade, and an amusing tcllow).</Paragraph> <Paragraph position="3"> Previous research on constructing taxonomies from machine readable dictionaries, i.e. Amsler & White (1979) and, to some extent, Chodorow et. al.</Paragraph> <Paragraph position="4"> (1985), has relied on a good deal of human intervention whenever the taxonomy is composed of word senses rather than spelling forms. Nakamura 8: Nagao (1988) automatically constructed a utxonomy, but did not distinguish the senses of nouns and hence cannot allow inheritance of properties along the links of the implied network created by the taxonomy.</Paragraph> <Paragraph position="5"> Because of the semantic category markings in LDOCE, we have been able to develop heuristic procedures (described in section 4), that, to a great extent, automate the task of developing a hierarchy of word senses.</Paragraph> <Paragraph position="6"> Constructing t,%xonomies from tt, e genus terms of definitions forces one to take a stand on how to treat a large class of noun definitions which are not as &quot;standard&quot; as the definition given above for knife. The characteristic property of these definitions is that the head of the first noun phrase (the usual place to find a genus term) seems vacuous, and another easily identifiable noun in the definition gives information about the headword. Nakamura & Nagao (1988), identify these non-sumdard definitions syntactically as: {det.} {adj.}* <Function Noun> of <Key Noun> {adj. phrasc}* For example, the following definitions have the property that the head of the noun phrase following the &quot;of&quot; is more semantically relevant to the headword than the head of the first noun phrase.</Paragraph> <Paragraph position="7"> arum (LDOCE) - a tall, white type of Lily cyclamate (LDOCE) - any of various man-made sweeteners ...</Paragraph> <Paragraph position="8"> deuterium (Meniam-Webster Pocket Diction,'try) - a form of hydrogen that is twice the mass of ordinary hydrogen academic (LDOCE) - a member of a college or university The form of this type of definition is predictable whenever certain words ,are used as the head of the tirst noun phrase. Amsler and White (1979) kept a list of these words, referring to them as partives and collectives. Nakamura & Nagao (1988) call them Function Nouns. Chodorow et al., (1985) refer to a subset of these as &quot;empty heads&quot;. Since we diS-Agree with certain elements of these characterizations, we will use the terminology &quot;disturbed heads&quot;. The question at issue is: what to do with these cases? In the original work of Amsler and White (1979) with the Merriam Webster Pocket Dictionary (MPD, 1964), file disturbed head cases were handled by asking paid human &quot;disambiguators&quot; to sense-tag the head of the first noun phrase in the definition and also to sense-tag any other noun in the definition which &quot;made a significant semantic contribution to an IS-A link&quot; (Amsler and White, 1979: p. 55) with the headword being defined (i.e. for the deuterium definition above, &quot;hydrogen&quot; was sense tagged as well as &quot;form&quot;). The taxonomy actually containexl both a link from deuterium to &quot;form&quot; and a link from deuterium to &quot;hydrogen&quot;, although the hydrogen sense was marked in a special way to indicate it is not the syntactic head of the definition. In cases like the &quot;hydrogen&quot; example just given, the marked &quot;semantic contributors&quot; were never given ancestors, since the link often represented a more loosely defined relation than the strictly transitive &quot;is a subset of' definition of IS-A, which ideally relates the head-word and its genus sense. This degenerate fo,zn of IS-A precludes inheritance in the network. It is included in the taxonomy in order to form links to words which may not be related in a strict IS-A sense, but which convey useful information about the word being defined.</Paragraph> <Paragraph position="9"> There have been various proposals over the years suggesting different specialized link types to be added to the taxonomy (besides the degenerate IS-A).</Paragraph> <Paragraph position="10"> Markowi~ et al., (1986) suggest HAS_MEMBER links be created in definitions which use the phrase &quot;member of&quot; (i.e. &quot;college&quot; HAS_MEMBER &quot;academic&quot; in the definition of academic above). Nakamura & Nagao (1988) identify 41 different function nouns and replace the IS-A link in their taxonomy with various other links in these cases (except in the &quot;kind of&quot;, &quot;type of&quot;, etc., definitions). Amster (1980) suggests the incorporation of an IS_PART_OF link in addition to the IS-A links in the earlier taxonomy of Amsler & White (1979).</Paragraph> <Paragraph position="11"> Chodorow et ~d., (1985) automate the genus finding process for nouns and verbs in Webster's Seventh (W7, 1967). However, in their work, only the spelling form of the genus is identified automatically; the sense selections are made by humans. The disambiguation here is not to attach a sense number, but rather to perform a function termed &quot;sprouting&quot; -2- 139 which interactively selects among all words which have a given word-sense as a genus. Their taxonomy contains only IS-A links, but they partially attack the &quot;disturbed head&quot; problem by identifying a small class of what they call &quot;empty heads&quot;. The effect of their method is to skip over seemingly vacuous terms (located where a genus is usually expected), and treat the more semantically relevant term as the actual genus.</Paragraph> <Paragraph position="12"> 3. Description of LDOCE and its limitations The Longman Dictionary of Contemporary English (LDOCE; Procter et at. 1978), is a full-sized dictionary designed for learners of English as a second language that contains 41,122 headword entries, defined in terms of 72,177 word senses, in machine-readable form (a type-setting tape). The book and tape versions of LDOCE both use a system of grammatical codes of about 110 syntactic categories which vary in generality from, for example, noun to noun/count to noun/count/Jbllowed-byinfinitive-with-TO. The machine readable version of LDOCE also contains &quot;box&quot; and &quot;subject&quot; codes that are not found in the book. The box codes use a set of primitives such as abstract, concrete, and animate, organized into a type hierarchy. This hierarchy of primitive types conforms to the classical notion of the IS-A relation as describing proper subsets. These primitives are used to assign type restrictions on nouns and adjectives, and type restrictions on the arguments of verbs. The subject codes are another set of terms organized into a hierarchy. This hierarchy consists of main headings such as engineering with subheadings like electrical. These terms are used to classify words by subject. For example, one sense of current is classified as geology-and-geography while another sense is marked eragineering/electrical, This paper's overall goal is to make implicit semantic information in the dictionary explicit. However, we are not doing &quot;psychology of lexicography&quot;: the test of our derived structures is not whether they match any conscious or unconscious inferences of lexicographers, but whether they improve subsequent natural language processing (e.g. machine translation). Nor are we in any way concerned here with low-level issues of the syntax of dictionary entries, its expression on tapes or pages, or by what device the information enters the computer. It is of course a strong assumption that a fallible dictionary designed for human learners of a second language also implicitly contains the information needed for successful natural language processing. We make this assumption consciously as an empirical hypothesis. Even though LDOCE has beneficial features, such as its restricted vocabulary for sense definition, we see no reason to believe at this stage that the taxonomic relations we derive are in any way non-standard.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4. Automatically finding genus senses </SectionTitle> <Paragraph position="0"> A heuristic procedure that automatically finds disambiguated genus terms for nouns has been developed. The initial stage of this procedure is to automatically identify the genus term in the definition. The Lexicon Provider (Slator 1988a, 1988b; Slator and Wilks, 1987, 1990) mentioned above has a parser which does this. The parser accepts LDOCE definitions as Lisp lists and produces phrase-structure trees. LDOCE sense definitions are typically one or more independent clauses composed of zero or more prepositional phrases, noun phrases, and/or relative clauses. The syntax of sense definitions is relatively uniform, and developing a grammar for the bulk of LDOCE has not proven to be an intractable problem. Chart parsing was selected for this system because of its utility as a grammar testing and development tool. The chart parser is driven by a context frec grammar of 100plus rules and has a lexicon derived from the 2,219 words in the LDOCE core vocabulary. The parser is left-comer, and bottom-up, with top-down filtering.</Paragraph> <Paragraph position="1"> The context-free grammar driving the chart parser is virtually unaugmented and, with certain minor exceptions, no procedure associates constituents with what they modify. Hence, there is little or no motivation for assigning elaborate or competing syntactic structures, since the choice of one over the other has no semantic consequence. Therefore, the trees are constructed to be as &quot;flat&quot; as possible. The parser also has a &quot;longest string&quot; (fewest constituents) syntactic preference. The grammar is still being tuned, but the chart parser is already quite successful and works extremely well over a fairly wide range of examples from the language of content word definitions in LDOCE. Ninety-Five percent result in a parse tree for the entire definition text. Five percent of the analyses fail at some point. In those cases where it fails the parser still returns a partial parse (of the leading constituents in the definition texO, and this is the most imporUmt part of a definition anyway.</Paragraph> <Paragraph position="2"> The second phase of this procedure is to find the correct sense of the genus term that has been identified by the parser. To do this, we have constructed a program called the Genus Disambiguator, which takes as input the subject codes (pragmatic codes) and box codes (semantic category codes) of the headword, taken from the machine readable version of LDOCE, and the spelling form of the genus word which has been identified by the parser described above. The output is the correct sense of the genus word.</Paragraph> <Paragraph position="3"> The codes in LDOCE seem to support the thesis that the genus for a noun must be a noun, and that the semantic category of the genus word must be 140 -3the same as, or an ancestor of, the semantic category of the headword. The word ancestor refers to superordinate terms in the hierarchy of semantic codes defined by the Longman lexicographers. The strategy of the algorithm is: 1. choose the genus sense whose semantic codes identically match with the headword, if possible; 2. if not, choose the sense whose semantic category is the closest ancestor to the semantic category of the headword; 3. in the case of a tie, the subject codes are used to determine the winner; 4. if subject codes cannot be used to break the tie, the first one of the tied senses which appears in the dictionary is chosen (since more frequently used senses are listed first in LDOCE), The lollowing examples illustrate the algorithm. The ordered pair following the headword consists of the box code and subject code as found in dictionary (the notation following that is the English gloss for these particular codes). Many definitions are not given a subject code in LDOCE mid a dash (--) is used here to indicate that. Consider the following LDOCE definition.</Paragraph> <Paragraph position="4"> ambulance - (J:movable-solid, AUZV: Automotive /Vehicle-Types) .- motor vehicle for carrying sick or wounded people esp.</Paragraph> <Paragraph position="5"> to hospital The genus of ambulance is the word &quot;vehicle&quot;, which is fl)und by the Lexicon Provider's parser; therefore the input to the Genus Disambiguator is: (ambulance J AUZV vehicle) The following are the LDOCE definitions for the noun senses of vehicledeg vehlcN!degl - (J:movable-solid, TNVH: Transportation /Vehicles) - something in or on which people or goods c,'m be carried from one place to another ...</Paragraph> <Paragraph position="6"> vehicle-2 (T:abstract,--) something by means of which something else can be passed on or spread: Television has become an important vehicle for spreading political ideas</Paragraph> <Paragraph position="8"> ing off a person's abilities: The writer wrote this big part in his play simply as a vehicle for the famous actress In this case the Genus Disambiguator chooses the tirst sense of vehicle, because of the match between the &quot;movable-solid&quot; semantic codes, therefore the output is &quot;vehicle-l&quot;. There are many cases, however, where a direct match is not found. Consider the following LDOCE definition.</Paragraph> <Paragraph position="9"> dart deg (J:movable-solid,GA:Games) o a small sharpwpointed object to be thrown, shot, etc ....</Paragraph> <Paragraph position="10"> The word &quot;object&quot; is the genus of dart, making the input to the Genus Disambigalator (dart J GA object) The following are the LDOCE noun definitions for</Paragraph> <Paragraph position="12"> or someone that produces interest or other effect ...</Paragraph> <Paragraph position="13"> object-3 ~ (l:human-and-solid,--) - something or someone unusual or that causes</Paragraph> <Paragraph position="15"> or with what, a PREPOSITION ...</Paragraph> <Paragraph position="16"> In this example there is no direct match between the semantic codes of the headword, dart, and any of the senses of the genus, &quot;object&quot;; therefore the Genus Disambiguator must traverse up the type hierarchy, described in section 3, to find the closest ancestor of boxcode &quot;J&quot; (movable-solid) that is present in the definitions of the genus word. In this case, boxcode &quot;S&quot; (solid) is found one level above &quot;J&quot; and the output is &quot;object-l&quot;. There are still other cases, however, when more than one sense definition has semantic codes matching the codes of the headword. Consider the following LDOCE definition.</Paragraph> <Paragraph position="17"> flute - (J:movable-solid,MU:Music) - a pipelike wooden or metal musical instrument with finger holes, played by blowing across a hole in the side ...</Paragraph> <Paragraph position="18"> The genus of flute is the word &quot;instrument&quot;; therefore, the input to the Genus Disambiguator is (flute J MU instrument) The following ,are the LDOCE definitions for instrumentdeg null instrumentol (J:movable-solid, HWZT: Hardware/Fools) - an object used to help in work: medical instruments</Paragraph> <Paragraph position="20"> ... an object which is played to give musical sounds (such as a piano, a horn,</Paragraph> <Paragraph position="22"> something which seems to be used by an outside force to cause something to happen: an instrument of fate -4o 141 In this case both the first and second senses of instrument are marked as &quot;J&quot;, (movable-solid), which matches perfectly with the selection restriction for flute. However, the tie is broken by appeal to the subject code, Music, which selects the second sense of instrument as the genus of flute, and the output is &quot;instrument-2&quot;.</Paragraph> <Paragraph position="23"> There are occasional failures, many of which appear to be due to unusual markings in LDOCE.</Paragraph> <Paragraph position="24"> For exmnple, the LDOCE definition for banana is: banana - (P:plant,PMZ5:Plant-Names) - any of several types of long curved tropical fruit, shaped like a thick finger, with a yellow skin and a soft, usu. sweet, inside ...</Paragraph> <Paragraph position="25"> The genus of banana is the word &quot;fruit&quot;, and the input to the Genus Disambiguator is (banana P PM fruit) The following are the LDOCE definitions for fruit.</Paragraph> <Paragraph position="27"> that grows on a tree or bush, contains seeds, is used for food, but is not usu.</Paragraph> <Paragraph position="28"> eaten with meat or with salt</Paragraph> <Paragraph position="30"> general, esp. considered as food ...</Paragraph> <Paragraph position="31"> phr. old fruit) In this case, banana is marked as a &quot;plant&quot; but, for some reason, the likely candidates defined under fruit are all marked &quot;solid&quot; or &quot;movable-solid&quot;. Since neither solid nor movable-solid ,are ancestor to plant in the LDOCE type hierarchy they are all equally bad, from the point of view of the Genus Disambiguator, and the default is invoked, which is to choose the lowest numbered sense from among the competitors.</Paragraph> <Paragraph position="32"> Therefore the first sense is selected and the output is &quot;fruit-l&quot;. This happens to be correct, but it is an unsatisfying resolution.</Paragraph> <Paragraph position="33"> In a piece of related work, Slator (1988a) has implemented a scheme in the Lexicon Provider which imposes deeper structure onto the LDOCE subject hierarchy (e.g. terms like Food, Botany, and Plant-Names in the &quot;fruit&quot; definitions above) relating these categories in a natural way, in order to discover important relationships between concepts within text. This manual restructuring simply observes that words classified under Botany have pragmatic connections to words classified as Plant-Names, as well as connections with other words classified under Science (connections not made by the LDOCE hierarchy as given), and that these connections are useful to exploit.</Paragraph> <Paragraph position="34"> The Lexicon Provider system relates these codes through a specially restructured hierarchy created for that purpose, making Communication, Economics, Entertainment, Household, Politics, Science, and Transportation the fundamental categories. Every word sense defined with a subject code therefore has a position in the new hierarchy, attached below the node for its subject code. Once this feature is implemented in the Genus Disambiguator, the subject code hierarchy can be used to resolve the &quot;banana-fruit&quot; case above in a somewhat more satisfactory way, by choosing sense 4 of fruit.</Paragraph> <Paragraph position="35"> 5. Identifying other relationships automatically The identification of a satisfactory genus term and the construction of a taxonomy is not straightforward in all cases. It is clear that the problems in this area are difficult, numerous, and can be seen to encompass a great variety of relationships. We believe that a thorough study of this shadowy area is necessary in order to make optimal use of the semantic information available in machine readable dictionaries. Although we do not have complete solutions, we have additional insights into the problem of extracting supplementary information from the &quot;disturbed head&quot; definitions.</Paragraph> <Paragraph position="36"> Chodorow et al. (1985) examined a phenomenon that they described as follows: &quot;If the word found belongs to a small class of &quot;empty heads&quot; (words like one, any, kind, class, manner, family, race, group, complex, etc.) and is followed by of, then the string following of is reprocessed in an effort to locate additional heads.&quot; (pg. 301).</Paragraph> <Paragraph position="37"> Although the empty head rule seems to be a reasonable one in certain situations, we have reservations about its use. The empty head rule produces undesirable effects in an IS-A hierarchy for some of the collective words (that Chodorow et al. treat as empty): set, group, class etc. Our response to the empty head phenomenon is to process them in the same way, but limiting this processing to a much smaller set; that is, to those heads that are truly empty -- the set containing {one, any, kind, type}.</Paragraph> <Paragraph position="38"> Consider the LDOCE definition: canteen - (British English) a set of knives, forks and spoons, usu. for 6 or 12 people null Since &quot;set&quot; is one of the empty heads for Chodorow et al., their procedure would create IS-A links to 142 ..5&quot;knives&quot;, &quot;forks&quot; and &quot;spoons&quot;, and this again would violate the inheritance properties that should be preserved via IS-A links. Our response to the collective heads, {set, group, collection, class of, family of} (which we maintain are not truly empty, simply disturbed), is to form a taxonomic link to the correct sense of &quot;set,&quot; &quot;group,&quot; or &quot;class&quot; etc. and to form a HASMEMBER link to the noun or nouns which describe the elements of the collective (as found in the differentia of the headword definition). Further, we propose that definitions in which the genus term is plural be treated in the stone way as those which begin with &quot;a set of''.</Paragraph> <Paragraph position="39"> In general, our view is that the disturbeA heads should be grouped in the sense of Nakamura & Nagao (1988), and that additional links (like HAS MEMBER, IS PART OF, etc.) should be created whenever they are appropriate. However, it is our position that IS-A links should also be created for every word sense given in the dictionary. Moreover, in order to maintain inheritance and transitivity in the IS-A network, a strict &quot;subset of&quot; definition of IS-A should be maintained.</Paragraph> <Paragraph position="40"> Unlike Nakamura & Nagao (1988), we propose that &quot;member of'' definitions should not be grouped with the &quot;set of&quot;, &quot;group of&quot; definitions. All but one &quot;member of&quot; definition in LDOCE uses ~'member of&quot; to mean &quot;person who is a member of&quot;. We recommend that in this case, a link be created from the headword to &quot;person&quot;, and that the appropriate MEMBER-OF link is constructed. The exceptional case, where &quot;member of`' does not refer to a person, is in the definition of feline : &quot;a meml~er of the cat family.&quot; This case must be treated separately, since it is impossible to identify the correct sense of the word &quot;member&quot; here, given that all these senses, in LDOCE, are marked as referring to a human or a part of the human body.</Paragraph> <Paragraph position="41"> The difficulty of these many varieties of special cases (~td they are not so special, since there are hundreds of them in the dictionary), is that they call into question certain of the long held assumptions about the taxonomic structure of dictionaries. The conventional wisdom has always been that dictionary definitions contained a genus term (a term more general than the one being defined), and that this term could almost invariably be found in the first phrase of the definition text. Further, the exceptions to this convention, the &quot;empty heads&quot; like &quot;one of&quot; or &quot;any of&quot;, have been viewed as being similarly wellbehaved. Our investigations lead us to conch\]de that things are not so simple as they once appeared; and the question of what to do with these troublesome cases is far from resolved.</Paragraph> </Section> class="xml-element"></Paper>