File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/93/p93-1032_intro.xml
Size: 3,827 bytes
Last Modified: 2025-10-06 14:05:28
<?xml version="1.0" standalone="yes"?> <Paper uid="P93-1032"> <Title>AUTOMATIC ACQUISITION OF A LARGE SUBCATEGORIZATION DICTIONARY FROM CORPORA</Title> <Section position="3" start_page="0" end_page="235" type="intro"> <SectionTitle> INTRODUCTION </SectionTitle> <Paragraph position="0"> Rule-based parsers use subcategorization information to constrain the number of analyses that are generated. For example, from subcategorization alone, we can deduce that the PP in (1) must be an argument of the verb, not a noun phrase modifier: null (1) John put \[Nethe cactus\] \[epon the table\]. Knowledge of subcategorization also aids text gereration programs and people learning a foreign language.</Paragraph> <Paragraph position="1"> A subcategorization frame is a statement of what types of syntactic arguments a verb (or adjective) takes, such as objects, infinitives, thatclauses, participial clauses, and subcategorized prepositional phrases. In general, verbs and adjectives each appear in only a small subset of all possible argument subcategorization frames.</Paragraph> <Paragraph position="2"> A major bottleneck in the production of high-coverage parsers is assembling lexical information, degThanks to Julian Kupiec for providing the tagger on which this work depends and for helpful discussions and comments along the way. I am also indebted for comments on an earlier draft to Marti Hearst (whose comments were the most useful!), Hinrich Schfitze, Penni Sibun, Mary Dalrymple, and others at Xerox PARC, where this research was completed during a summer internship; Stanley Peters, and the two anonymous ACL reviewers.</Paragraph> <Paragraph position="3"> such as subcategorization information. In early and much continuing work in computational linguistics, this information has been coded laboriously by hand. More recently, on-line versions of dictionaries that provide subcategorization information have become available to researchers (Hornby 1989, Procter 1978, Sinclair 1987). But this is the same method of obtaining subcategorizations - painstaking work by hand. We have simply passed the need for tools that acquire lexical information from the computational linguist to the lexicographer.</Paragraph> <Paragraph position="4"> Thus there is a need for a program that can acquire a subcategorization dictionary from on-line corpora of unrestricted text: 1. Dictionaries with subcategorization information are unavailable for most languages (only a few recent dictionaries, generally targeted at non-native speakers, list subcategorization frames). 2. No dictionary lists verbs from specialized subfields (as in I telneted to Princeton), but these could be obtained automatically from texts such as computer manuals.</Paragraph> <Paragraph position="5"> 3. Hand-coded lists are expensive to make, and invariably incomplete.</Paragraph> <Paragraph position="6"> 4. A subcategorization dictionary obtained auto null matically from corpora can be updated quickly and easily as different usages develop. Dictionaries produced by hand always substantially lag real language use.</Paragraph> <Paragraph position="7"> The last two points do not argue against the use of existing dictionaries, but show that the incomplete information that they provide needs to be supplemented with further knowledge that is best collected automatically) The desire to combine hand-coded and automatically learned knowledge 1A point made by Church and Hanks (1989). Arbitrary gaps in listing can be smoothed with a program such as the work presented here. For example, among the 27 verbs that most commonly cooccurred with from, Church and Hanks found 7 for which this suggests that we should aim for a high precision learner (even at some cost in coverage), and that is the approach adopted here.</Paragraph> </Section> class="xml-element"></Paper>