File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/92/h92-1047_intro.xml
Size: 3,117 bytes
Last Modified: 2025-10-06 14:05:18
<?xml version="1.0" standalone="yes"?> <Paper uid="H92-1047"> <Title>The Acquisition of Lexical Semantic Knowledge from Large Corpora</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1. Introduction </SectionTitle> <Paragraph position="0"> Machine-readable dictionaries provide the raw material from which to construct computationally useful representations of the generic vocabulary contained within it.</Paragraph> <Paragraph position="1"> Many sublanguages, however, are poorly represented in on-line dictionaries, if represented at all (cf. Grishman et al (1986)). Yet vocabularies geared to specialized domains are necessary for many applications, such as text categorization and information retrieval. In this paper I describe research devoted to developing techniques for building sublanguage lexicons via syntactic and statistical corpus analysis coupled with analytic techniques based on the tenets of a generative theory of the lexicon (Pustejovsky 1991).</Paragraph> <Paragraph position="2"> Unlike with purely statistical collocational analyses, the framework of a lexical semantic theory allows the automatic construction of predictions about deeper semantic relationships among words appearing in collocational systems. I illustrate the approach for the acquisition of lexical information for several lexical classes, and how such techniques can fine tune the lexical structures acquired from an initial seeding of a machine-readable dictionary, i.e. the machine-tractable version of the LDOCE (Wilks et al (1991)).</Paragraph> <Paragraph position="3"> The aim of our research is to discover what kinds of knowledge can be reliably acquired through the use of these methods, exploiting, as they do, general linguistic knowledge rather than domain knowledge. In this respect, our program is similar to Zernik (1989) and Zernik and Jacobs (1990), working on extracting verb semantics from corpora using lexical categories. Our research, however, differs in two respects: first, we employ a more expressive lexical semantics; secondly, our focus is on all major categories in the language, and not just verbs. This is important since for full-text information retrieval, information about nominals is paramount, as most queries tend to be expressed as conjunctions of nouns. From a theoretical perspective, I believe that the contribution of the lexical semantics of nominals to the overall structure of the lexicon has been somewhat neglected, relative to that of verbs (cf.</Paragraph> <Paragraph position="4"> Pustejovsky and Anick (1988), Bogutaev and Pustejovsky (1990)). Therefore, where others present ambiguity and metonymy as a potential obstacle to effective corpus analysis, we believe that the existence of motivated metonymic structures actually provides valuable clues for semantic analysis of nouns in a corpus. To demonstrate these points, I describe experiments performed within the DIDEROT Tipster Extraction project (of Brandeis University and New Mexico State University), over a corpus of joint venture articles.</Paragraph> </Section> class="xml-element"></Paper>