XML Viewer - c90-2002

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/90/c90-2002_metho.xml
Size: 23,484 bytes
Last Modified: 2025-10-06 14:12:24
<?xml version="1.0" standalone="yes"?>
<Paper uid="C90-2002">
  <Title>An Application of Lexical Semantics to Knowledge Acquisition from Corpora</Title>
  <Section position="2" start_page="0" end_page="0" type="metho">
    <SectionTitle>
Keywords:: Knowledge Acquisition, Information Re-
</SectionTitle>
    <Paragraph position="0"/>
  </Section>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
:, Introduction
</SectionTitle>
    <Paragraph position="0"> 'Fl:e proliferation of on-line textual information has intensified the search for ctlieient automated in(h;x-.</Paragraph>
    <Paragraph position="1"> ing and retrieval techniques, l&amp;quot;ull-text indexing, in which all the content words in a document are used as keywords, is one of the most promising of recent automated approaches, yet its mediocre precision and i'ecall characteristics indicate that there is much room for improvement \[Croft, 1989\]. The use of domain knowledge can enhance the elDctiveness of a full-text :~;ystem by providing related ~erms that can be used lo broaden, narrow, or retbcus a query at retrieval time (\[Thompson and Croft 1989\], \[Anick et al, 1!)89\] \[l)ebill et al, 1988\]. Likewise, domain knowledge may I,e applied at indexing time to do word sense disambiguation \[Krovetz &amp; Croft, 19891 or content analy~;is \[Jacobs, 1989\]. Unfortunately, for many domains, ~mch knowledge, even in the form of a thesaurus, is either not available or is incomplete with respect to the vocabulary of the texts indexed.</Paragraph>
    <Paragraph position="2"> The tradition in both AI and Library Science has been to hand-craft domain knowledge., but the curb:eut availability of machine-readal)le dictioimries and large text corpora presents the possibility of deriving at least some domain knowledge via automated procedures \[amsler, 1980\] \[Maarek and Smadja, 1989\] \[Wilks et al, 1988\]. The approach describe.d in this paper outlines one such experiment.</Paragraph>
    <Paragraph position="3"> We start with: (1) a lexicon containing morpho~yntaetie information for el&gt;proximately 20,000 con,mon Fmglish words; (2) encodings of English morphological paradigms and a morphological analyzer capable of producing potential citation forms fi'om intlected forms; (3) a bottom-up parser for recognizing sub-sentential t)hrasal constructions; and (4) a theory of lexical semantics embodying a collection of powerful semantic princilfles and their syntactic realizations. null The aim of our research is to discover what kinds of knowledge can be reliably acquired through tile use of these methods, exploiting, as they do, general linguistic knowh'.dge rather than domain knowledge. In this respect, our program is similar to Zernik's (1989) work on extracting verb semantics from corpora using lexical categories. ()tar research, however, differs in two respects: first, we employ a more expressive lexical semantics for encoding lexical knowledge; and secondly, our focus is on nominals, for both pragmatic and theoretical reasons, l'~or full-text information retrieval, information about nominals is pararnomlt, as most queries tend to be expressed as conjunctions O\[&amp;quot; t\]O/lllS. Frolll Ollr thc'oretical perspective, we believe that the c(mtribution of the lexieal semantics of nominals to the overall structure of the lexicon has been somewhat neglected (relative to that of verbs) \[1)ustejovsky and Anick, 1!:)88\], \[Pustejovsky 1989\].</Paragraph>
    <Paragraph position="4"> Indeed, whereas Zcrnik (1989) presents metonymy as a potc~d, ial obstacle to effective corpus analysis, we beliew: that the existe,ce of motivated nletonymic structures provides valuable clues for semantic analysis of nouns in a corpus.</Paragraph>
    <Paragraph position="5"> Our current work attempts to acquire the following kinds of lexical information without domain knowl- null edge: o Part of st;eech and morphological paradigms for new words and new uses of old words; o Bracketing of noun compounds; o Subclass relations between nouns; o Lexical semantic categorization of nouns; o Clustering of verbs into semantic classes based  on the collections of nouns they predicate.</Paragraph>
    <Paragraph position="6"> While such information is still inadequate for natural language &amp;quot;understanding&amp;quot; systems, it. vastly simplifies the task of knowledge engineering, should one desire to hand-code lexical items. Furthernlore, such information can be trot to use directly in full-text</Paragraph>
    <Paragraph position="8"> information retrieval systems, fulfilling some of the roles typically played by thesauri and faceted classifications \[Vickery, 1975\].</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 A \]~amework for Lexical Semantics
</SectionTitle>
    <Paragraph position="0"> The framework for lexical knowledge we will be assuming is that developed by Pustejovsky (1989), who proposes a theory of lexieal semantics which explores the internal structure of lexical items from a computational perspective. In this theory, lexical and conceptual decomposition is performed generatively.</Paragraph>
    <Paragraph position="1"> That is, rather than assuming a fixed set of primitives, we assume a fixed set of rules of composition and generative devices. Thus, just a.s a formal language is described more in terms of the productions of the grammar rather than in terms of its accompanying vocabulary, a semantic language should be defined by the rules generating the structures for expressions, rather than the vocabulary of primitives itself. For this reason, a dictionary of lexical items and the concepts they derive can be viewed a.s a generative lexicon) Such a theory of lexical meaning specifies both a general methodology and a specific language for expressing the semantic content of lexical items in natural language. The aspect of this theory most relevant to our own concerns is a language for structuring the semantics of nominals. Pustejovsky (1989) calls this the Qualia Structure of a noun, which is essentially a structured representation similar to a verb's argument structure. This structure specifies four aspects of a noun's meaning: its constituent parts; its formal structure; its purpose and function (i.e. its Telic role); and how it comes about (i.e. its Agentive role).</Paragraph>
    <Paragraph position="2"> For example, book might be represented as containing the following information:</Paragraph>
    <Paragraph position="4"> This permits us to use the same lexical representation in very different contexts, where the word seems to refer to different qualia of the noun's meaning. For example, the sentences in (2)-(3) refer to different aspects (or qualia) of the general meaning of book.</Paragraph>
    <Paragraph position="5">  Sentence (1) makes reference to the Formal role, while sentence (3) refers to the Constitutive role. Example (2), however, can refer to either the Telic or the Agentive aspects given above. The utility of such knowledge for information retrieval is readily apparent. This theory claims that noun meanings should make reference to related concepts and the relations into which they enter. The qualia structure, thus, can IF or elaboration on this idea and how it applies to various lexical classes, see Pustejovsky (forthcoming).</Paragraph>
    <Paragraph position="6"> be viewed as a kind of generic template for structuring this knowledge.</Paragraph>
    <Paragraph position="7"> To further illustrate how objects cluster according to these dimensions, we will briefly consider three object types: (1) containers (of information), e.g. book, tape, record; (2) instruments, e.g. gun, /Jammer, paintbrush; and (3) figure-ground objects, e.g. door, room, fireplace. Because of how their qualia structures differ, these classes appear in vastly different grammatical contexts.</Paragraph>
    <Paragraph position="8"> As with containers in general, information containers permit metonymic extensions between the container and the material contained within it. Collocations such as those in (4) through (7) indicate that this metonymy is gramrnaticalized through specific and systematic head-PP constructions.</Paragraph>
    <Paragraph position="9"> 4: read a book 5: read a story in a book 6: read a tape 7: read the informatioz~ on ghe tape Instruments, on the other hand, display classic agent-instrument causative alternations, such as those in (8) through (ll).</Paragraph>
    <Paragraph position="10">  8: ... smash the wse with the hammer 9: The hammer smashed the wtse.</Paragraph>
    <Paragraph position="11"> 10: ... kill him with a g~m 11: The gun killed him.</Paragraph>
    <Paragraph position="12"> Finally, figure-ground nominals permit perspective shifts such ms those in (12) through (15). 2 12: John painted the door.</Paragraph>
    <Paragraph position="13"> 13: John walked through the door.</Paragraph>
    <Paragraph position="14"> 14: John is scrubbing the fireplace.</Paragraph>
    <Paragraph position="15"> 15: The smoke filled the fireplace.</Paragraph>
    <Paragraph position="16"> That is, paint and scrub are actions on physical  objects while walk through and fill are processes in spaces. These collocational patterns, we argue, are systematically predictable from the lexical semantics of the noun, and we term such sets of collocated phrases collocational systems, a To make this point clearer, let us consider a specific example of a collocational system. Because of the particular nmtonymy observed for a noun like tape, we will classify it as a 'container.' In terms of the senmntic representation presented here, we can view it as a relational noun, with the following qualia structure: null tape(*x*, *y*) \[Const: information(*y*)\] \[Form: phys-obj ect (*x*)\] \[Telic : hold(S,*x*,*y*)\] \[Agent: artifact(*x*) ~ write(T,w,*y*)\] This simply states that any semantics for tape must logically make reference to the object itself (F), 2See Pustejovsky and Anick (1988) for details.</Paragraph>
    <Paragraph position="17"> aThis relates to Mel'~uk's lexical functions and the syntactic structures they associate with an element. See Mel'euk (1988) and references therein. Cruse (1986) discusses the foregrounding and backgronnding of information with respect to similar examples.</Paragraph>
    <Paragraph position="18"> 8 2 what it can contain (C), what purpose it serves (T); and how it arises (A). This provides us with a semantic representation which can capture the multiple perspectives which a single lexical item may assume in different contexts. Yet, the qualia for a lexical item such as tape are not isolated values for that one word, but are integrated into a global knowledge b~tse indicating how these senses relate to other lexical items and their senses. This is the contribution of inheritance and the hierarchical structuring of knowledge (e.g. \[Brachman and Schmolze 1985\] and \[Bobrow and Winograd 1977\]). In Pustejovsky (1989), it is suggested that there are two types of relational structures for lexical knowledge; a t~xed inheritance similar to that of an ISA hierarchy (cf. Touretsky (1986))4; and a dymamic structure which operates generatively from the qualia structure of a lexical item to create a relational structure for ad hoc categories.</Paragraph>
    <Paragraph position="19"> Let us suppose then, that in addition to tile tixed relational structures, our semantics allows us to dynamically create arbitrary concepts through the application of certain transformations to lexical meanings. For example, for any predicate, Q --- e.g. the value of a qualia role -- we can generate its opposition, -,Q. By relating these two predicates temporally we can generate the arbitrary transition events for this opposition. Similarly, by operating over other qualia role values we can generate semantically related concepts. The set of transformations includes: -~, negation, &lt;, temporal precedence, &gt;, temporal succession, =, temporal equivalence, and act, an operator adding agency to an argument.</Paragraph>
    <Paragraph position="20"> Intuitively, the space of concepts traversed by the application of such operators will be related expressions in the neighborhood of the original lexical item. We will call this the Projective Conclusion Space of a specific quale for a lexical item. 5 To return to the example of tape above, the predicates read and copy are related to the Telic value by just such an operation. PredicalLes such as mount and dismount, however, are related to the Formal role since they refer to the tape as a physical object alone.</Paragraph>
    <Paragraph position="21"> It is our view that the approach outlined above for representing lexical knowledge can be put to use in the service of information retrieval tasks. On the one hand, the projective conclusion space, with its structured assembly of terms, clusterd about a nominal entity, can serve as a &amp;quot;virtual script&amp;quot;, capable of homonym disambiguation (\[Krovetz 1990\], \[Cullingford and Pazzani 1984\]) and query reformulation. On the other hand, the qualia structure cal)tures the inherent polysemy of many nouns. In the latter respect, our proposal can be compared to attempts at object classification in information science. One approach, known as &amp;quot;faceted classification&amp;quot; (Vickery (1975)) proceeds roughly as follows. Collect terms lying within a field. Then, group the terms into facets by assigning them to categories. Typical examples of 4The,~aurus-like structures are similar within the Ill. community, ef. \[National Library and Information Asso- null this are state, property, reaction, device, tlowever, each subject area is likely to have its own sets of categories, making it ditficult to re-use a set of facet classifications in another domain. 6 Even if the relational information provided by the qualia structure and inheritance would improve per%rmance in information retrieval tasks, one problem still remains; namely that it would be very time-consuming to hand-code such structnres for all nouns in a domain. Since it is our belief that such representations are generic structures across all domains, our long term goal is to develop methods \['or how these relations and values can be automatically extracted from on-line corpora. In the section that follows, we describe one such experiment which indicates that the qualia ,;tructures do, in fact, correlate with collocational systems, thereby allowing us to perform structure-matching operations over corpora to find these relations.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 A Knowledge Acquisition
</SectionTitle>
    <Paragraph position="0"> In this section, we outline our procedure for knowledge acquisition, implemented as part of the LINKS Lexicon/Corpus Management System. r Steps are illustrated with examples drawn from an analysis done on a Digital Equipnlent Corporation on-line corpus of 3000 articles containing VMS troubleshooting information. Briefly, the procedure consists of the fob lowing steps.</Paragraph>
    <Paragraph position="1">  1. Assign nmrphological paradigms to words in the corpus.</Paragraph>
    <Paragraph position="2"> 2. Generate of a set of bracketed noun compounds, e.g. \[TK50 \[tape drive\]\], \[\[datable management\] system\].</Paragraph>
    <Paragraph position="3"> 3. Collect Nmm Phrases related by prepositions from the collocational systems for the desired lexical items, e.g. &amp;quot;file on tape&amp;quot;, &amp;quot;format of tape&amp;quot;.</Paragraph>
    <Paragraph position="4"> 4. Hypot:hesize subclass relationships on the  basis of collocational information: e.g. If X and Y are nouns and the phrase X Y appears in the corpus, and there is no phra.se Y Prep X, then 1.5'A(X,Y). For example: From \[TK50 \[tape drive\]\] we can predict that ISA(TK50, iape drive), lIowever, the potential prediction from &amp;quot;tape drive&amp;quot; that ISA(tape, drive) is blocked by the existence of phrase,~ like &amp;quot;tape in drive&amp;quot;.</Paragraph>
    <Paragraph position="5"> 5. Seek distributional verification of subclass relationships. For each subclass so generated, seek distributional evidence to support tile hypothesis. That is, is there a &amp;quot;substantial&amp;quot; inter- null known lexical category. Try to match the set of syntactic constructions within which X appears with one of our diagnostic construction sets. This may involve searching for the set of constructions that contain nouns in other argument positions of the original set of constructions. For example, the set of expressions involving the word &amp;quot;tape&amp;quot; in the context of its use as a secondary storage device suggests that it fits the container artifact schema of the qualia structure, with &amp;quot;information&amp;quot; and &amp;quot;file&amp;quot; as its containees:  (a) read information from tape (b) write file to tape (c) read information on tape (d) read tape (e) write tape 7. Use heuristics to cluster predicates that  relate to the Telic quale of the noun. For example, the word &amp;quot;tape&amp;quot; is the object of 34 verbs in our corpus: (require use unload replace mount restore time request control position dismount allocate off initialize satisfy contain create encounter get allow try leave be load read write have cause protect up perform enforce copy) Among these verbs are some that refer to the formal quale: mount, dismount and some which refer to tape in its function ,as an information container: read, write, and copy.</Paragraph>
    <Paragraph position="6"> One of the ways to tease these sets apart is to take advantage of the linguistic rule that allows a container to be referred to in place of the contaiuce, i.e. the container can be used metonymically. The verbs which have &amp;quot;information&amp;quot; (previously identified as a likely &amp;quot;containee&amp;quot; for tape) as an object in the corpus are: (check include display enter compare list find get extract set be write fit contain read recreate update return provide specify see open publish give insert have copy take relay lose gather) When we intersect the verb sets for &amp;quot;information&amp;quot; and &amp;quot;tape&amp;quot;, we get a set that reflects the predicates appropriate to the telic role of tape, a container of information (plus several empty verbs): (copy have read contain write be get) Thus, the metonymy between container and containee allows us to use set intersection to discriminate among predicates referring to the telic vs. formal roles of the container.</Paragraph>
    <Paragraph position="7"> What results from this acquisition procedure is a kind of minimal faceted analysis for the noun tape, as illustrated below.</Paragraph>
    <Paragraph position="8"> tape ( *x*, *y*) \[Const : information(*y*), file(*y*)\] \[Form: mount (w,*x*), dismount (w,*x*)\] \[Telic: read(T,z,*y*), write(T,z,*y*), copy (T, z,*y*)\] \[Agent : artifact (*x*)\] To illustrate this procedure on another semantic category, consider the term &amp;quot;mouse&amp;quot; in its computer artifact sense. In our corpus, it appears in the object position of the verb &amp;quot;use&amp;quot; in a &amp;quot;use-to&amp;quot; construction, as well as the object of tile preposition &amp;quot;with&amp;quot; following a transitive verb and its object:  (a) use the mouse to set breakpoints (b) use the mouse anywhere (c) move a window with the mouse (d) click on it with tile mouse ...</Paragraph>
    <Paragraph position="9">  These constructions are symptomatic of its role as an instrument; and the VP complement of &amp;quot;to&amp;quot; as well as the VP dominating the &amp;quot;with&amp;quot; PP's identify the telic predicates for the noun. Other verbs, for which &amp;quot;mouse&amp;quot; appears as a direct object are currently defaulted into the forreal role, resulting in an entry for &amp;quot;mouse&amp;quot; as follows: mouse(*x*)  Thus, by bringing together the automatic construction of collocational systems with a notion of qualia structure for nouns, we have arrived at a fairly useful lexical representation for Information Retrieval tasks.</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Discussion
</SectionTitle>
    <Paragraph position="0"> Previous investigators involved in corpus analysis using weak methods have documented limited successes and warned of many pitfalls (e.g.</Paragraph>
    <Paragraph position="1"> \[Grishman el al 1986\] and \[Zernik 1989\]). The approach described here differs from previous efforts in its combination of diagnostic collocational systems with a generic target representation for nouns. While our limited experiments with the acquisition algorithm show some promise, it is too early to tell how well this approach will do in a larger corpus containing a greater range of senses for terms. One danger is for the algorithm to be overly optimistic in matching a set of occurrences to a diagnosis. Given the rampant ambiguity of prepositions and the potential for verb object combinations that, can spuriously suggest metonymic I0 4 relationships, we have found the algorithm as it stands to be too susceptible to jumping to false conclusions. We are looking to improve precision by increasing our repertoire of both positive and negative diagnostics, as well as by incorl)orating information theoretic statistics (as in Church and ttindle (1990)).</Paragraph>
    <Paragraph position="2"> Likewise, we have been investigating ways to reduce misses - cases in which evidence of relationships between terms known to be related is not detected by our current set, of heuristics.</Paragraph>
    <Paragraph position="3"> One case in point regards our analysis of &amp;quot;disk&amp;quot;, which we initially expected to behaw~' similar to &amp;quot;tape&amp;quot; in its telic quale, ttowever, the intersection of predicate sets for &amp;quot;disk&amp;quot; and &amp;quot;information&amp;quot; yielded the terms (copy s;pecify set be have) Missing are &amp;quot;read and &amp;quot;write&amp;quot;, the relic predicates for tape. This exarnf, le reveals tlw subtleties present in the container nletonylny.</Paragraph>
    <Paragraph position="4"> Specifically, tlle container can stand in for its contents only in I, hose situations where one refers to the contents as a whole. While one lypically &amp;quot;reads&amp;quot; an entire talle, o~le usually reads only parts of a disk at a time. ':Copying&amp;quot; a whole disk is more typical, however, and hence shows u 1) in our corpus. Reading and writing still apply to disks; \]lowcvcr, since tllt'.y do not apply kolislically, we find instead construct.ions with the I)repositions to and from. e.g. read/write f?om the disk.</Paragraph>
    <Paragraph position="5"> This example ilhi;-tl'atcs the pill'ails thai arc hMdng if the linguistic rules are too coarsei~ de fined, but it also shows that such rules are liot domain specific, an(l thus, once I)rol)('rly formulated, could function in a general lmrposc diagnostic context. It, renlains an empirical question how well weak method,, can I;e employc(l to dis-criminate among thequaleofano.n. While this constitutes the primary focus of our current research, we also I)elieve that the abo~c melho(/s complement well other ongoing rcsearc\]l iJl the construction of word-disan~biguatcd dictionaries (e.g. \[m~,in 1990\]).</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML