File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/w02-1502_metho.xml
Size: 17,952 bytes
Last Modified: 2025-10-06 14:08:08
<?xml version="1.0" standalone="yes"?> <Paper uid="W02-1502"> <Title>The Grammar Matrix: An Open-Source Starter-Kit for the Rapid Development of Cross-Linguistically Consistent Broad-Coverage Precision Grammars</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Preliminary Development of Matrix </SectionTitle> <Paragraph position="0"> We have produced a preliminary version of the grammar matrix relying heavily on the LinGO project's English Resource Grammar, and to a lesser extent on the Japanese grammar developed jointly between DFKI Saarbr&quot;ucken (Germany) and YY Technologies (Mountain View, CA). This early version of the matrix comprises the following components: null #0F Types defining the basic feature geometry and technical devices (e.g., for list manipulation).</Paragraph> <Paragraph position="1"> #0F Types associated with Minimal Recursion Semantics (see, e.g., Copestake, Lascarides, & Flickinger, 2001), a meaning representation language which has been shown to be well-suited for semantic composition in typed feature structure grammars. This portion of the grammar matrix includes a hierarchy of relation types, types and constraints for the propagation of semantic information through the phrase structure tree, a representation of illocutionary force, and provisions for grammar rules which make semantic contributions.</Paragraph> <Paragraph position="2"> #0F General classes of rules, including derivational and inflectional (lexical) rules, unary and binary phrase structure rules, headed and non-headed rules, and head-initial and head-final rules. These rule classes include implementations of general principles of HPSG, like, for example, the Head Feature and Non-Local Feature Principles.</Paragraph> <Paragraph position="3"> #0F Types for basic constructions such as headcomplement, head-specifier, head-subject, head-filler, and head-modifier rules, coordination, as well as more specialized classes of constructions, such as relative clauses and noun-noun compounding. Unlike in specific grammars, these types do not impose any ordering on their daughters in the grammar matrix. null Included with the matrix are configuration and parameter files for the LKB grammar engineering environment (Copestake, 2002).</Paragraph> <Paragraph position="4"> Although small, this preliminary version of the matrix already reflects the main goals of the project: (i) Consistent with other work in HPSG, semantic representations and in particular the syntax-semantics interface are developed in detail; (ii) the types of the matrix are each representations of generalizations across linguistic objects and across languages; and (iii) the richness of the matrix and the incorporation of files which connect it with the LKB allow for extremely quick start-up as the matrix is applied to new languages.</Paragraph> <Paragraph position="5"> Since February 2002, this preliminary version of the matrix has been in use at two Norwegian universities, one working towards a broad-coverage reference implementation of Norwegian (NTNU), the other--for the time being--focused on specific aspects of clause structure and lexical description (Oslo University). In the first experiment with the matrix, at NTNU, basic Norwegian sentences were parsing and producing reasonable semantics within two hours of downloading the matrix files.</Paragraph> <Paragraph position="6"> Linguistic coverage should scale up quickly, since the foundation supplied by the matrix is designed not only to provide a quick start, but also to support long-term development of broad-coverage grammars. Both initiatives have confirmed the utility of the matrix starter kit and already have contributed to a series of discussions on cross-lingual HPSG design aspects, specifically in the areas of argument structure representations in the lexicon and basic assumptions about constituent structure (in one view, Norwegian exhibits a VSO topology in the main clause). The user groups have suggested refinements and extensions of the basic inventory, and it is expected that general solutions, as they are identified jointly, will propagate into the existing grammars too.</Paragraph> </Section> <Section position="5" start_page="0" end_page="3" type="metho"> <SectionTitle> 3 A Detailed Example </SectionTitle> <Paragraph position="0"> As an example of the level of detail involved in the grammar matrix, in this section we consider the analysis of intersective and scopal modification. The matrix is built to give Minimal Recursion Semantics (MRS; Copestake et al., 2001; Copestake, Flickinger, Sag, & Pollard, 1999; Copestake, Flickinger, Malouf, Riehemann, & Sag, 1995) representations. The two English examples in (1) exemplify the difference between intersective and scopal modification: (1) a. Keanu studied Kung Fu on a spaceship.</Paragraph> <Paragraph position="1"> b. Keanu probably studied Kung Fu.</Paragraph> <Paragraph position="2"> The MRSs for (1a-b) (abstracting away from agreement information) are given in (2) and (3). The MRSs are ordered tuples consisting of a top handle (h1 in both cases), an instance or event variable (e in both cases), a bag of elementary predications (eps), and a bag of scope constraints (in these cases, QEQ constraints or 'equal modulo quantifiers'). In a well-formed MRS, the handles can be These examples also differ in that probably is a pre-head modifier while on a spaceship is a post-head modifier. This word-order distinction cross-cuts the semantic distinction, and our focus is on the latter, so we won't consider the word-order aspects of these examples here.</Paragraph> <Paragraph position="3"> identified in one or more ways respecting the scope constraints such that the dependencies between the eps form a tree. For a detailed description of MRS, see the works cited above. Here, we will focus on the difference between the intersective modifier on (a spaceship) and the scopal modifier probably.</Paragraph> <Paragraph position="4"> In (2), the ep contributed by on ('on-rel') shares its handle (h7) with the ep contributed by the verb it is modifying ('study-rel'). As such, the two will always have the same scope; no quantifier can intervene. Further, the second argument of the on-rel (e) is the event variable of the study-rel. The first argument, e , is the event variable of the on-rel and the third argument, z, is the instance variable of the spaceship-rel.</Paragraph> <Paragraph position="6"> In (3), the ep contributed by the scopal modifier probably ('probably-rel') has its own handle (h7) which is not shared by anything. Furthermore, it takes a handle (h8) rather than the event variable of the study-rel as its argument. h8 is equal modulo quantifiers (QEQ) to the handle of the study-rel (h9), and h7 is equal modulo quantifiers to the argument of the prpstn-rel (h2). The prpstn-rel is the ep representing the illocutionary force of the whole expression. This means that quantifiers associated with the NPs Keanu and Kung Fu can scope inside or outside probably.</Paragraph> <Paragraph position="7"> (3)h h1, e, f h1:prpstn-rel(h2), h3:def-np-rel(x, h4, h5), h6:named-rel(x, 'Keanu'), h7:probably-rel(h8), h9:study-rel(e, x, y), h10:def-np-rel(y, h11, h12), h13:named-rel(y, 'Kung Fu') g, f h2 QEQ h7, h4 QEQ h6, h8 QEQ h9, h11 QEQ h13 g i While the details of modifier placement, which parts of speech can modify which kinds of phrases, etc., differ across languages, we believe that all languages display a distinction between scopal and intersective modification. Accordingly, the types necessary for describing these two kinds of modification are included in the matrix.</Paragraph> <Paragraph position="8"> The types isect-mod-phrase and scopal-mod-phrase (shown in Figures 1 and 2) encode the information necessary to build up in a compositional manner the modifier portions of the MRSs in (2) and (3).</Paragraph> <Paragraph position="9"> These types are embedded in the type hierarchy of the matrix. Through their supertype headmod-phr-simple they inherit information common to many types of phrases, including the basic feature geometry, head feature and non-local feature passing, and semantic compositionality. These types also have subtypes in the matrix specifying the two word-order possibilities (pre- or post-head modifiers), giving a total of four subtypes.</Paragraph> <Paragraph position="10"> The most important difference between these types is in the treatment of the handle of the head daughter's semantics, to distinguish intersective and scopal modification. In isect-mod-phrase, the top handles (TOP) of the head and non-head (i.e., modifier) daughters are identified (#hand). This allows for MRSs like (2) where the eps contributed by the head ('study-rel') and the modifier ('on-rel') take the same scope. The type scopal-mod-phrase bears no such constraint. This allows for MRSs like (3) where the modifier's semantic contribution ('probably-rel') takes the handle of the head's semantics ('study-rel') as its argument, so that the modifier outscopes the head. In both types of mod- null All four subtypes are provided on the theory that most languages will make use of all or most of them. ifier phrase, a constraint inherited from the supertype ensures that the handle of the modifier is also the handle of the whole phrase.</Paragraph> <Paragraph position="11"> The constraints on the LOCAL value inside the modifier's MOD value regulate which lexical items can appear in which kind of phrase. Intersective modifiers specify lexically that they are [ MOD h [ LOCAL isect-mod ] i] and scopal modifiers specify lexically that they are</Paragraph> <Paragraph position="13"> These constraints exemplify the kind of information that will be developed in the lexical hierarchy of the matrix. It is characteristic of broad-coverage grammars that every particular analysis interacts with many other analyses. Modularization is an on-going concern, both for maintainability of individual grammars, and for providing the right level of abstraction in the matrix. For the same reasons, we have only been able to touch on the highlights of the semantic analysis of modification here, but hope that this quick tour will suffice to illustrate the extent of the jump-start the matrix can give in the development of new grammars.</Paragraph> </Section> <Section position="6" start_page="3" end_page="3" type="metho"> <SectionTitle> 4 Future Extensions </SectionTitle> <Paragraph position="0"> The initial version of the matrix, while sufficient to support some useful grammar work, will require substantial further development on several fronts, including lexical representation, syntactic generalization, sociolinguistic variation, processing issues, and evaluation. This first version drew most heavily from the implementation of the English grammar, with some further insights drawn from the grammar of Japanese. Extensions to the matrix will be based on careful study of existing implemented grammars for other languages, notably German, Spanish and Japanese, as well as feed-back from those using the first version of the matrix. null For lexical representation, one of the most urgent needs is to provide a language-independent type hierarchy for the lexicon, at least for major parts of speech, establishing the mechanisms used for linking syntactic subcategorization to semantic predicate-argument structure. Lexical rules provide a second mechanism for expressing general- null Note that there are no further subtypes of LOCAL values beyond isect-mod and scopal-mod. Since these grammars do not make extensive use of subtypes of LOCAL values, they were available for encoding this distinction. Alternative solutions include positing a new feature.</Paragraph> <Paragraph position="1"> izations within the lexicon, and offer ready opportunities for cross-linguistic abstractions for both inflectional and derivational regularities. Work is also progressing on establishing a standard relational database (using PostgreSQL) for storing information for the lexical entries themselves, improving both scalability and clarity compared to the current simple text file representation. Form-based tools will be provided both for constructing lexical entries and for viewing the contents of the lexicon.</Paragraph> <Paragraph position="2"> The primary focus of work on syntactic generalization in the matrix is to support more freedom in word order, for both complements and modifiers. The first step will be a relatively conservative extension along the lines of Netter (1996), allowing the grammar writer more control over how a head combines with complements of different types, and their interleaving with modifier phrases.</Paragraph> <Paragraph position="3"> Other areas of immediate cross-linguistic interest include the hierarchy of head types, control phenomena, clitics, auxiliary verbs, noun-noun compounds, and more generally, phenomena that involve the word/phrase distinction, such as noun incorporation. A study of the existing grammars for English, German, Japanese, and Spanish reveals a high degree of language-specificity for several of these phenomena, but also suggests promise of reusable abstractions.</Paragraph> <Paragraph position="4"> Several kinds of sociolinguistic variation require extensions to the matrix, including grammaticized aspects of pragmatics such as politeness and empathy, as well as dialect and register alternations. The grammar of Japanese provides a starting point for representations of both empathy and politeness.</Paragraph> <Paragraph position="5"> Implementations of familiar vs. formal verb forms in German and Spanish provide further instances of politeness to help build the cross-linguistic abstractions. Extensions for dialect variation will build on some exploratory work in adapting the English grammar to support American, British, and Australian regionalisms, both lexical and syntactic, while restricting dialect mixture in generation and associated spurious ambiguity in parsing. While the development of the matrix will be built largely on the LKB platform, support will also be needed for using the emerging grammars on other processing platforms, and for linking to other packages for pre-processing the linguistic input.</Paragraph> <Paragraph position="6"> Several other platforms exist which can efficiently parse text using the existing grammars, including the PET system developed in C</Paragraph> <Paragraph position="8"> University (Germany) and the DFKI (Callmeier, 2000); the PAGE system developed in Lisp at the DFKI (Uszkoreit et al., 1994); the LiLFeS system developed at Tokyo University (Makino, Yoshida, Torisawa, & Tsujii, 1998), and a parallel processing system developed in Objective C at Delft University (The Netherlands; van Lohuizen, 2002).</Paragraph> <Paragraph position="9"> As part of the matrix package, sample configuration files and documentation will be provided for at least some of these additional platforms.</Paragraph> <Paragraph position="10"> Existing pre-processing packages can also significantly reduce the effort required to develop a new grammar, particularly for coping with the morphology/syntax interface. For example, the ChaSen package for segmenting Japanese input into words and morphemes (Asahara & Matsumoto, 2000) has been linked to at least the LKB and PET systems. Support for connecting implementations of language-specific pre-processing packages of this kind will be preserved and extended as the matrix develops. Likewise, configuration files are included to support generation, at least within the LKB, provided that the grammar conforms to certain assumptions about semantic representation using the Minimal Recursion Semantics framework.</Paragraph> <Paragraph position="11"> Finally, a methodology is under development for constructing and using test suites organized around a typology of linguistic phenomena, using the implementation platform of the [incr tsdb()] profiling package (Oepen & Flickinger, 1998; Oepen & Callmeier, 2000). These test suites will enable better communication about current coverage of a given grammar built using the matrix, and serve as the basis for identifying additional phenomena that need to be addressed cross-linguistically within the matrix. Of course, the development of the typology of phenomena is itself a major undertaking for which a systematic cross-linguistic approach will be needed, a discussion of which is outside the scope of this report. But the intent is to seed this classification scheme with a set of relatively coarse-grained phenomenon classes drawn from the existing grammars, then refine the typology as it is applied to these and new grammars built using the matrix.</Paragraph> </Section> <Section position="7" start_page="3" end_page="3" type="metho"> <SectionTitle> 5 Case Studies </SectionTitle> <Paragraph position="0"> One important part of the matrix package will be a library of phenomenon-based analyses drawn from the existing grammars and over time from users of the matrix, to provide working examples of how the matrix can be applied and extended. Each case study will be a set of grammar files, simplified for relevance, along with documentation of the analysis, and a test suite of sample sentences which define the range of data covered by the analysis.</Paragraph> <Paragraph position="1"> This library, too, will be organized around the typology of phenomena introduced above, but will also make explicit reference to language families, since both similarities and differences among related languages will be of interest in these case studies. Examples to be included in the first release of this library include numeral classifiers in Japanese, subject pro drop in Spanish, partial-VP fronting in German, and verb diathesis in Norwegian. null</Paragraph> </Section> class="xml-element"></Paper>