File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/93/j93-2005_abstr.xml
Size: 5,623 bytes
Last Modified: 2025-10-06 13:47:52
<?xml version="1.0" standalone="yes"?> <Paper uid="J93-2005"> <Title>Lexical Semantic Techniques for Corpus Analysis</Title> <Section position="2" start_page="0" end_page="0" type="abstr"> <SectionTitle> 1. Introduction </SectionTitle> <Paragraph position="0"> The proliferation of on-line textual information poses an interesting challenge to linguistic researchers for several reasons. First, it provides the linguist with sentence and word usage information that has been difficult to collect and consequently largely ignored by linguists. Second, it has intensified the search for efficient automated indexing and retrieval techniques. FulMext indexing, in which all the content words in a document are used as keywords, is one of the most promising of recent automated approaches, yet its mediocre precision and recall characteristics indicate that there is much room for improvement (Croft 1989). The use of domain knowledge can enhance the effectiveness of a full-text system by providing related terms that can be used to broaden, narrow, or refocus a query at retrieval time (Debili, Fluhr, and Radasua 1988; Anick et al. 1989. Likewise, domain knowledge may be applied at indexing time to do word sense disambiguation (Krovetz and Croft 1989) or content analysis (Jacobs 1991). Unfortunately, for many domains, such knowledge, even in the form of a thesaurus, is either not available or is incomplete with respect to the vocabulary of the texts indexed.</Paragraph> <Paragraph position="1"> * Computer Science Department, Brandeis University, Waltham MA 02254.</Paragraph> <Paragraph position="2"> t Computer Science Department, Concordia University, Montreal, Quebec H3G 1M8, Canada. Digital Equipment Corporation, 111 Locke Drive LM02-1/D12, Marlboro MA 01752.</Paragraph> <Paragraph position="3"> (~) 1993 Association for Computational Linguistics Computational Linguistics Volume 19, Number 2 In this paper we examine how linguistic phenomena such as metonymy and polysemy might be exploited for the semantic tagging of lexical items. Unlike purely statistical collocational analyses, employing a semantic theory allows for the automatic construction of deeper semantic relationships among words appearing in collocational systems. We illustrate the approach for the acquisition of lexical information for several classes of nominals, and how such techniques can fine-tune the lexical structures acquired from an initial seeding of a machine-readable dictionary. In addition to conventional lexical semantic relations, we show how information concerning lexical presuppositions and preference relations (Wilks 1978) can also be acquired from corpora, when analyzed with the appropriate semantic tools. Finally, we discuss the potential that corpus studies have for enriching the data set for theoretical linguistic research, as well as helping to confirm or disconfirm linguistic hypotheses.</Paragraph> <Paragraph position="4"> The aim of our research is to discover what kinds of knowledge can be reliably acquired through the use of these methods, exploiting, as they do, general linguistic knowledge rather than domain knowledge. In this respect, our program is similar to Zernik's (1989) work on extracting verb semantics from corpora using lexical categories. Our research, however, differs in two respects: first, we employ a more expressive lexical semantics; second, our focus is on all major categories in the language, and not just verbs. This is important since for full-text information retrieval, information about nominals is paramount, as most queries tend to be expressed as conjunctions of nouns. From a theoretical perspective, we believe that the contribution of the lexical semantics of nominals to the overall structure of the lexicon has been somewhat neglected, relative to that of verbs. While Zernik (1989) presents ambiguity and metonymy as a potential obstacle to effective corpus analysis, we believe that the existence of motivated metonymic structures actually provides valuable clues for semantic analysis of nouns in a corpus.</Paragraph> <Paragraph position="5"> We will assume, for this paper, the general framework of a generative lexicon as outlined in Pustejovsky (1991). In particular, we make use of the principles of type coercion and qualia structure. This model of semantic knowledge associated with words is based on a system of generative devices that is able to recursively define new word senses for lexical items in the language. These devices and the associated dictionary make up a generative lexicon, where semantic information is distributed throughout the lexicon to all categories. The general framework assumes four basic levels of semantic description: argument structure, qualia structure, lexical inheritance structure, and event structure.</Paragraph> <Paragraph position="6"> Connecting these different levels is a set of generative devices that provide for the compositional interpretation of words in context. The most important of these devices is a semantic transformation called type coercion--analogous to coercion in programming languages--which captures the semantic relatedness between syntactically distinct expressions. As an operation on types within a A-calculus, type coercion can be seen as transforming a monomorphic language into one with polymorphic types (cf. Cardelli and Wegner 1985). Argument, event, and qualia types must conform to the well-formedness conditions defined by the type system defined by the lexical inheritance structure when undergoing operations of semantic composition. ~</Paragraph> </Section> class="xml-element"></Paper>