File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/02/w02-1509_intro.xml
Size: 2,564 bytes
Last Modified: 2025-10-06 14:01:36
<?xml version="1.0" standalone="yes"?> <Paper uid="W02-1509"> <Title>Coping with problems in grammars automatically extracted from treebanks</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Much linguistic research is oriented to finding general principles for natural language, classifying linguistic phenomena, building regular models (e.g., grammars) for the well-behaved (or wellunderstood) part of languages and studying remaining &quot;interesting&quot; problems in a compartmentalized way. With the availability of large natural language corpora annotated for syntactic structure, the treebanks, e.g., (Marcus et al., 1993), automatic grammar extraction became possible (Chen and Vijay-Shanker, 2000; Xia, 1999). Suddenly, grammars started being extracted with an attempt to have &quot;full&quot; coverage of the constructions in a certain language (of course, to the extent that the used corpora represents the language) and that immediately poses a question: If we do not know how to model many phenomena grammatically how can that be that we are extracting such a wide-coverage grammar?.</Paragraph> <Paragraph position="1"> To answer that question we have to start a new thread at the edge of linguistics and computational linguistics. More than numbers to express coverage, we have to start analyzing the quality of automatically generated grammars, identifying extraction problems and uncovering whatever solutions are being given for them, however interesting or ugly they might be, challenging the current paradigms of linguistic research to provide answers for the problems on a &quot;by-need&quot; basis.</Paragraph> <Paragraph position="2"> In this paper we report on a particular experience of automatic extraction of an English grammar from the WSJ corpus of the Penn Treebank (PTB) (Marcus et al., 1994)1 using Tree Adjoining Grammar (TAGs, (Joshi and Schabes, 1997)). We use an automatic tool developed by (Xia, 2001) properly adapted to our particular needs and focus on some problems we have found to extract a linguistically (and computationally) sound grammar and the solutions we gave to them. The list of problems is a sample, far from being exhaustive2 Likewise, the solutions will not always be satisfactory.</Paragraph> <Paragraph position="3"> In Section 2 we introduce the method of grammar extraction employed. The problems are discussed in Section 3. We conclude in Section 4.</Paragraph> </Section> class="xml-element"></Paper>