File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/w02-1507_metho.xml
Size: 24,158 bytes
Last Modified: 2025-10-06 14:08:07
<?xml version="1.0" standalone="yes"?> <Paper uid="W02-1507"> <Title>A Classification of Grammar Development Strategies</Title> <Section position="2" start_page="0" end_page="0" type="metho"> <SectionTitle> 1 Introduction: Four grammar </SectionTitle> <Paragraph position="0"> development strategies There are several potential strategies to build wide-coverage grammars, therefore there is a need for classifying these various strategies. In this paper, we propose a classification of grammar development strategies according to two criteria : a2 Hand-crafted versus Automatically acquired grammars a2 Grammars based on a low versus high level of syntactic abstraction.</Paragraph> <Paragraph position="1"> As summarized in table 1, our classification yields four types of grammars, which we call respectively type A, B, C and D.</Paragraph> <Paragraph position="2"> Of these four types, three have already been implemented to develop wide-coverage grammars for English within the Xtag project, and an implementation of the fourth type is underway 1. Most of our examples are based on the development of wide coverage Tree Adjoining Grammars (TAG), but it is important to note that the classification is relevant within other linguistic frameworks as well (HPSG, GPSG, LFG etc.) and is helpful to discuss portability among several syntactic frameworks.</Paragraph> <Paragraph position="3"> We devote a section for each type of grammar in our classification. We discuss the advantages and drawbacks of each approach, and especially focus 1We do not discuss here shallow-parsing approaches, but only f ull grammar development. Due to space limitations, we do not introduce the TAG formalism and refer to (Joshi, 1987) for an introduction.</Paragraph> <Paragraph position="4"> on how each type performs w.r.t. grammar coverage, linguistic adequacy, maintenance, over- and under- generation as well as to portability to other syntactic frameworks. We discuss grammar replication as a mean to compare these approaches. Finally, we argue that the fourth type, which is currently being implemented, exhibits better development properties.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 TYPE A Grammars: hand-crafted </SectionTitle> <Paragraph position="0"> The limitations of Type A grammars (hand-crafted) are well known : although linguistically motivated, developing and maintaining a totally hand-crafted grammar is a challenging (perhaps unrealistic ?) task. Such a large hand-crafted grammar for TAGs is described for English in (XTAG Research Group, 2001). Smaller hand-crafted grammars for TAGs have been developed for other languages (e.g. French (Abeille, 1991)), with similar problems. Of course, the limitations of hand-crafted grammars are not specific to the TAG framework (see e.g. (Clement and Kinyon, 2001) for LFG).</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Coverage issues </SectionTitle> <Paragraph position="0"> The Xtag grammar for English, which is freely downloadable from the project homepage 2 (along with tools such as a parser and an extensive documentation), has been under constant development for approximately 15 years. It consists of more than 1200 elementary trees (1000 for verbs) and has been tested on real text and test suites. For instance, (Doran et al., 1994) report that 61% of 1367 grammatical sentences from the TSNLP test-suite (Lehman and al, 1996) were parsed with an early version of the grammar. More recently, (Prasad and Sarkar, 2000) evaluated the coverage of the grammar on &quot;the weather corpus&quot;, which contained rather complex sentences with an average length of 20 words per sentence, as well as on the &quot;CSLI LKB test suite&quot; (Copestake, 1999). In addition, in order to evaluate the range of syntactic phenomena covered by the Xtag grammar, an internal test-suite which contains all the example sentences (grammatical and ungrammatical) from the continually updated documentation of the grammar is distributed with the grammar. (Prasad and Sarkar, 2000) argue that constant evaluation is useful not only to get an idea of the coverage of a grammar, but also as a way to continuously improve and enrich the grammar 3.</Paragraph> <Paragraph position="1"> Parsing failures were due, among other things, to POS errors, missing lexical items, missing trees (i.e. grammar rules), feature clashes, bad lexicon grammar interaction (e.g. lexical item anchoring the wrong tree(s)) etc.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Maintenance issues </SectionTitle> <Paragraph position="0"> As a hand-crafted grammar grows , consistency issues arise and one then needs to develop maintenance tools. (Sarkar and Wintner, 1999) describe such a maintenance tool for the Xtag grammar for English, which aims at identifying problems such as typographical errors (e.g. a typo in a feature can prevent unification at parse time and hurt performance), undocumented features (features from older versions of the grammar, that no longer exist), type-errors (e.g. English verb nodes should not be assigned a gender feature), etc. But even with such maintenance tools, coverage, consistency and maintenance issues still remain.</Paragraph> <Paragraph position="1"> tences in the weather corpus because this corpus contained frequent free relative constructions not handled by the grammar. After augmenting the grammar, 89.6% of the sentences did get a parse.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.3 Are hand-crafted grammars useful ? </SectionTitle> <Paragraph position="0"> Some degree of automation in grammar development is unavoidable for any real world application : small and even medium-size hand-crafted grammar are not useful for practical applications because of their limited coverage, but larger grammars give way to maintenance issues. However, despite the problems of coverage and maintenance encountered with hand-crafted grammars, such experiments are invaluable from a linguistic point of view. In particular, the Xtag grammar for English comes with a very detailed documentation, which has proved extremely helpful to devise increasingly automated approaches to grammar development (see sections below) 4.</Paragraph> </Section> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 TYPE B Grammars: Automatically </SectionTitle> <Paragraph position="0"> extracted To remedy some of these problems, Type B grammars (i.e. automatically acquired, mostly from annotated corpora) have been developed. For instance (Chiang, 2000), (Xia, 2001) (Chen, 2001) all automatically acquire large TAGs for English from the Penn Treebank (Marcus et al., 1993). However, despite an improvement in coverage, new problems arise with this type of grammars : availability of annotated data which is large enough to avoid sparse data problems, possible lack of linguistic adequacy, extraction of potentially unreasonably large grammars (slows down parsing and increases ambiguity), 4Perhaps fully hand-crafted grammars can be used in practice on limited domains, e.g. the weather corpus. However, a degree of automation is useful even in those cases, if only to insure consistency and avoid some maintenance problems.</Paragraph> <Paragraph position="1"> lack of domain and framework independence (e.g. a grammar extracted from the Penn Treebank will reflect the linguistic choices and the annotation errors made when annotating the treebank).</Paragraph> <Paragraph position="2"> We give two examples of problems encountered when automatically extracting TAG grammars: The extraction of a wrong domain of locality; And The problem of sparse-data regarding the integration of the lexicon with the grammar.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Wrong domain of locality </SectionTitle> <Paragraph position="0"> Long distance dependencies are difficult to detect accurately in annotated corpora, even when such dependencies can be adequately modeled by the grammar framework used for extraction (which is the case for TAGs, but not for instance for Context Free Grammars). For example, (Xia, 2001) extracts two elementary trees from a sentence such as Which dog does Hillary Clinton think that Chelsea prefers.</Paragraph> <Paragraph position="1"> These trees are shown on figure 1. Unfortunately, because of the potentially unbounded dependency, the two trees exhibit an incorrect domain of locality: the Wh-extracted element ends up in the wrong elementary tree, as an argument of &quot;think&quot;, instead of as an argument of &quot;prefer&quot; 5</Paragraph> <Paragraph position="3"> This problem is not specific to TAGs, and would translate in other frameworks into the extraction of the &quot;wrong&quot; dependency structure6.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Sparse data for lexicon-grammar </SectionTitle> <Paragraph position="0"> integration Existing extraction algorithms for TAGs acquire a fully lexicalized grammar. A TAG grammar may be ing simple CFGs, and/or that this problem is only of interest to linguists. A counter-argument is that linguistic adequacy of a grammar, whether extracted or not, DOES matter. An extreme caricature to illustrate this point : the context free grammar S a33 S word a34 word allows one to robustly and unambiguously parse any text, but is not very useful for any further NLP. hand &quot;tree templates&quot; and on the other hand a lexicon which indicates which tree template(s) should be associated to each lexical item 7.</Paragraph> <Paragraph position="1"> Suppose the following three sentences are encountered in the training data : 1. Peter watches the stars 2. Mary eats the apple 3. What does Peter watch ?From these three sentences, two tree templates will be correctly acquired, as shown on figure 2 : The first one covers the canonical order of the realization of arguments for sentences 1 and 2, the second covers the case of a Wh-extracted object for sentence 3. Concerning the interaction between the lexicon and the grammar rules, the fact that &quot;watch&quot; should select both trees will be accurately detected. However, the fact that &quot;eat&quot; should also select both trees will be missed since &quot;eat&quot; has not been encountered in a Wh-extractedObject construction. Anchors eat and watch</Paragraph> <Paragraph position="3"> lexicon-grammar interface A level of syntactic abstraction is missing : in this case, the notion of subcategory frame. This is especially noticeable within the TAG framework from the fact that in a TAG hand-crafted grammar the grammar rules are grouped into &quot;tree families&quot;, with one family for each subcategorization frame (transitive, intransitive, ditransitive, etc.), whereas automatically extracted TAGs do not currently group trees into families.</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 TYPE C Grammars </SectionTitle> <Paragraph position="0"> To remedy the lack of coverage and maintenance problems linked to hand-crafted grammars, as well as the lack of generalization and linguistic adequacy of automatically extracted grammars, new syntactic levels of abstraction are defined. In the context of TAGs, one can cite the notion of MetaRules (Becker, 2000), (Prolo, 2002)8, and the notion of formalism, see (Carroll et al., 2000) or (Evans et al., 2000)</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 MetaRules </SectionTitle> <Paragraph position="0"> A MetaRule works as a pattern-matching tool on trees. It takes as input an elementary tree and outputs a new, generally more complex, elementary tree. Therefore, in order to create a TAG, one can start from one canonical elementary tree for each subcategorization frame and a finite number of MetaRules which model syntactic transformations (e.g. passive, wh-questions etc) and automatically generate a full-size grammar. (Prolo, 2002) started from 57 elementary trees and 21 hand-crafted MetaRules, and re-generated the verb trees of the hand-crafted Xtag grammar for English described in the previous section.</Paragraph> <Paragraph position="1"> The replication of the hand-crafted grammar for English, using a MetaRule tool, presents interesting aspects : it allows to directly compare the two approaches. Some trees generated by (Prolo, 2002) were not in the hand-crafted grammar (e.g. various orderings of &quot;by phrase passives&quot;) while some others that were in the hand-crafted grammar were not generated by the MetaRules9. This replication process makes it possible, with detailed scrutiny of the results, to : a2 Identify what should be consider as under- or over- generation of the MetaRule tool.</Paragraph> <Paragraph position="2"> a2 Identify what should be considered to be under- or over- generation of the hand-crafted grammar.</Paragraph> <Paragraph position="3"> Thus, grammar replication tasks make it possible to improve both the hand-crafted and the MetaRule generated grammars.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 MetaGrammars </SectionTitle> <Paragraph position="0"> Another possible approach for compact and abstract grammar encoding is the MetaGrammar (MG), initially developed by (Candito, 1996). The idea is to compact linguistic information thanks to an additional layer of linguistic description, which imposes a general organization for syntactic information in a Each terminal class in dimension 1 describes a possible initial subcategorization (i.e. a TAG tree family). Each terminal class in dimension 2 describes a list of ordered redistributions of functions (e.g. it allows to add an argument for causatives, 9Due to space limitations, we refer to (Prolo, 2002) for a detailed discussion.</Paragraph> <Paragraph position="1"> to erase one for passive with no agents ...). Each terminal class in dimension 3 represents the surface realization of a surface function (ex: declares if a direct-object is pronominalized, wh-extracted, etc.). Each class in the hierarchy corresponds to the partial description of a tree (Rogers and Vijay-Shanker, 1994). A TAG elementary tree is generated by inheriting from exactly one terminal class from dimension 1, one terminal class from dimension 2, and n terminal classes from dimension 3 (where n is the number of arguments of the elementary tree being generated). For instance the elementary tree for &quot;Par qui sera accompagnee Marie&quot; (By whom will Mary be accompanied) is generated by inheriting from transitive in dimension 1, from impersonal-passive in dimension 2 and subjectnominal-inverted for its subject and questionedobject for its object in dimension 3. This compact representation allows one to generate a 5000 tree grammar from a hand-crafted hierarchy of a few dozens of nodes, esp. since nodes are explicitly defined only for simple syntactic phenomena 10. The MG was used to develop a wide-coverage grammar for French (Abeille et al., 1999). It was also used to develop a medium-size grammar for Italian, as well as a generation grammar for German (Gerdes, 2002) using the newly available implementation described in (Gaiffe et al., 2002). A similar MetaGrammar approach has been described in (Xia, 2001) for English 11.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.3 MetaGrammars versus MetaRules: which </SectionTitle> <Paragraph position="0"> is best ? It would be desirable to have a way of comparing the results of the MetaGrammar approach with that of the MetaRule approach. Unfortunately, this is not possible because so far none of the two approaches have been used within the same project(s). Therefore, in order to have a better comparison between these two approaches, we have started a second replication of the Xtag grammar for English, this time using a MG. This replication should allow us to make a direct comparison between the hand-crafted grammar, the grammar generated with MetaRules and the grammar generated with a MG.</Paragraph> <Paragraph position="1"> For this replication task, we use the more recent implementation presented in (Gaiffe et al., 2002) because their tool : tree in the grammar is associated with a feature structure which describes its salient linguistic properties 14.</Paragraph> <Paragraph position="2"> In the (Gaiffe et al., 2002) implementation, each class in the MG hierarchy can specify : inates, precedes, equals) a2 traditional feature equations for agreement. The MG tool automatically crosses the nodes in the hierarchy, looking to create &quot;balanced&quot; classes, that is classes that do not need nor provide anything. From these balanced terminal classes, elementary trees are generated. Figure 3 shows how a canonical transitive tree is automatically generated from 3 hand-written classes and the quasi-trees associated to these classes 15.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.4 Advantages and drawbacks of TYPE C </SectionTitle> <Paragraph position="0"> grammars It is often assumed that Metarule and MetaGrammar approaches exhibit some of the advantages of hand-crafted grammars (linguistic relevance) as well as some of the advantages of automatically extracted grammars (wide-coverage), as well as easier maintenance and better coherence. However, as is pointed out in (Barrier et al., 2000), grammar development based on hand-crafted levels of abstraction give rise to new problems while not necessarily solving all the old problems: Although the automatic generation of the grammar insures some level a MetaGrammar of 3 classes : a62 stands for &quot;father of&quot;, a63 for &quot;precedes&quot;, a64 for anchor nodes and a65 for substitution nodes.</Paragraph> <Paragraph position="1"> of consistency, problems arise if mistakes are made while hand-crafting the abstract level (hierarchy or MetaRules) from which the grammar is automatically generated. This problem is actually more serious than with simple hand-crafted grammars, since an error in one node will affect ALL trees that inherit from this node. Furthermore, a large portion of the generated grammar covers rare syntactic phenomena that are not encountered in practice, which unnecessarily augments the size of the resulting grammars, increases ambiguity while not significantly improving coverage 16. One crucial problem is that despite the automatic generation of the grammar (which eases maintenance), the interface between lexicon and grammar is still mainly man16For instance, the 5000 tree grammar for French parses 80% of (simple) TSNLP sentences, and does not parse newspaper text, whereas the 1200 tree hand-crafted Xtag grammar for English does. Basically, instead of solving both under-generation and over-generation problems, a hand-crafted abstract level of syntactic encoding runs the risk of increasing both ually maintained (and of course one of the major sources of parsing failures is due to missing or erroneous lexical entries).</Paragraph> </Section> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 TYPE D Grammars </SectionTitle> <Paragraph position="0"> However, the main potential advantage of such an abstract level of syntactic representation is framework independence. We argue that the main drawbacks of an abstract level of syntactic representation (over-generation, propagation of manual errors to generated trees, interface with the lexicon) may be solved if this abstract level is acquired automatically instead of being hand-crafted. Other problems such as sparse data problems are also handled by such a level of abstraction 17. This corresponds to type D in our classification. A preliminary description of this work, which consist in automatically extracting the hierarchy nodes of a MetaGrammar from the Penn Treebank (i.e. a high level of syntactic abstraction) may be found in (Kinyon and Prolo, 2002). The underlying idea is that a lot of abstract framework independent syntactic information is implicitly present in the treebank, and has to be retrieved. This includes : subcategorization information, potential valency alternations (e.g. passives are detected by a morphological marker on the POS of the verb, by the presence of an NP-Object &quot;trace&quot;, and possibly by the presence of a Prepositional phrase introduced by &quot;by&quot;, and marked as &quot;logical-subject&quot;), and realization of arguments (e.g. Wh-extractions are noticed by the presence of a Wh constituent, co-indexed with a trace). In order to retrieve this information, we have examined all the possible tag combinations of the Penn Treebank 2 annotation style, and have determined for each combination, depending on its location in the annotated tree whether it was an argument (optional or compulsory) or a modifier. We mapped each argument to a syntactic function 18. This allowed us to extract fine-grained subcategorization frames for each verb in the treebank. Each subcategorization frame is stored as a finite number of final classes using the (Gaiffe et al., 2002) MG tool : one class for each subcategorization frame (dimension 1 in Candito's terminology), and one class for 17As discussed in section 3, if one sees eat in the data, and one sees some other transitive verb with a Wh extracted object, the elementary tree for &quot;What does J. eat&quot; is correctly generated, even if eat has never been encountered in such a construction in the data, which is not the case with the automatic extraction of traditional lexicalized grammars 18We use the following functions : subject, predicative, direct object, second object, indirect object, LocDir object. each function realization (dimension 3 in Candito's terminology). The same technique is used to acquire the valency alternation for each verb, and non-canonical syntactic realizations of verb arguments (Wh extractions etc...). This amounts to extracting &quot;hypertags&quot; (Kinyon, 2000) from the treebank, transforming these Hypertags into a MetaGrammar, and automatically generating a TAG from the MG.</Paragraph> <Paragraph position="1"> An example of extraction may be seen on figure 4 : expose appears here in a reduced-relative construction. However, from the trace occupying the canonical position of a direct object, the program retrives the correct subcategorization frame (i.e. tree family) for this verb. Hence, just this occurence of expose correctly extracts the MG nodes from which both the &quot;canonical tree&quot; and the &quot;Reduced relative tree&quot; will be generated. If one was extracting a simple type B grammar, the canonical tree would not be retrieved in this example.</Paragraph> <Paragraph position="2"> This work is still underway 19. From the abstract level of syntactic generalization, a TAG will be automatically generated. It is interesting to note that the resulting grammar does not have to closely reflect the linguistic choices of the annotated data from which it was extracted (contrary to type B grammars). Moreover, from the same abstract syntactic data, one could also generate a grammar in another framework (ex. LFG). Hence, this abstract 19For now, this project has already yielded, as a byproduct, a freely available program for extracting verb subcategorization frames (with syntactic functions) from the Penn Treebank level may be viewed as a syntactic interlingua which can solve some portability issues 20.</Paragraph> </Section> class="xml-element"></Paper>