File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/w97-1501_metho.xml
Size: 21,903 bytes
Last Modified: 2025-10-06 14:14:49
<?xml version="1.0" standalone="yes"?> <Paper uid="W97-1501"> <Title>Some apparently disjoint aims and requirements for grammar development environments: the case of natural language generation</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 The lack of use of analysis-based </SectionTitle> <Paragraph position="0"> GDE's for generation There is clearly a partially 'sociological' explanation to the lack of exchange between approaches in analysis and generation: the groups working on analysis and text generation are by and large disjoint and the questions and issues thus central in these groups are also at best only partially overlapping. This is not, however, sufficient. Most input to analysis-oriented work (e.g., (Pulman, 1991)) has attempted to achieve a workable level of generality and formal well-foundedness that would guarantee the widespread applicability and re-usability of their results. If this were sufficient and had been successful, one could expect generation developers to have availed themselves of these results. But uptake for generation continues to be restricted to those working in the analysis-oriented tradition, mostly in the pursuit of 'bi-directional' sentence generation on the basis of resources developed primarily for analysis.</Paragraph> <Paragraph position="1"> 'Core' text generation activities remain untouched.</Paragraph> <Paragraph position="2"> One, more contentful, reason for this is that the particular requirements of generation favour an organization of linguistic resources that has itself proved supportive of powerful development and generation environments. To clarify the needs of generation and the relation to the GDE's adopted, we can cross-classify approaches adopted for generation according to the kind of generation targetted. This largely corresponds to the size of linguistic unit generated. Thus we can usefully distinguish generation of single phrases, generation of single sentence or utterance generation (such as might also still occur in MT most typically or in utterance generation in dialogue systems), generation of connected texts of a single selected text type, and generation of connected texts of a variety of text types (e.g., showing variation for levels of user expertise, etc.). These are distinguished precisely because it is well known from generation work that different issues play a role for these differing functionalities.</Paragraph> <Paragraph position="3"> Three generation 'environments' cover the majority of projects concerned with text generation where generation for some practical purpose(s) is the main aim, not the development of some particular linguistic treatment or pure research into problems of generation or NLP generally. These are Elhadad's (Elhadad, 1990) 'Functional Unification Formalism' (FUF), the KPML/Penman systems (Mann and Matthiessen, 1985; Bateman, 1997), and approaches within the Meaning-Text Model (cf. (Mel'Suk and Zholkovskij, 1970)) as used in the CoGenTex-family of generators. Here resources appropriate for real generation are accordingly understood as broad coverage (with respect to a target application or set of applications) linguistic descriptions of languages that provide mappings from enriched semantic specifications (including details of communicative effects and textual organization) to corresponding surface strings in close to real-time. In addition, there are many systems that adopt in contrast a template-based approach to generation--now often combined with full generation in so-called 'hybrid' frameworks. While, finally, there is a very small number of serious, large-scale and/or practical projects where analysisderived grammatical resources are adopted. This distribution is summarized in Table 1. Importantly, it is only for the approaches in the final righthand column that standard analysis-based GDE's appear to be preferred or even applicable. 2</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Communicative function: a </SectionTitle> <Paragraph position="0"> common thread in generation resources It is well known in natural language generation (NLG) that functional information concerning the communicative intent of some utterance provides a convenient and necessary organization for generator decisions (cf. (McDonald, 1980; Appelt, 1985; McKeown, 1985)). Different approaches focus the role of communcative functions to a greater or less de2For references to individual systems see the Web or a detailed current state of the art such as Zock and Adorni (Zock and Adorni, 1996) or Bateman (Bateman, to appear).</Paragraph> <Paragraph position="1"> gree. Some subordinate it entirely to structure, some attempt to combine structure and function felicitously, others place communicative function clearly in the foreground. Among these latter, approaches based on systemic-functional linguistics (SFL) have found the widest application. Both the FUF and KPML/Penman environments draw heavily on SFL.</Paragraph> <Paragraph position="2"> This is to emphasize the role of the paradigmatic organization of resources in contrast to their syntagmatic, structural organization. It turns out that it is this crucial distinction that provides the cleanest account of the difference between a GDE such as ALEP and one such as KPML.</Paragraph> <Paragraph position="3"> Viewed formally, a paradigmatic description of grammar such as that of SFL attempts to place as much of the work of the description in the type lattice constituting the grammar. The role of constraints over possible feature structures is minimal. Moreover, the distinctions represented in the type lattice subsume all kinds of grammatical variation-including variations that in, for example, an HPSG-style account might be considered as examples of the application of lexical rules. Diathesis alternations are one clear example; differing focusing constructions are another. These are all folded into the type lattice. Generation with such a resource is then reduced to traversing the type lattice, generally from least-specific to most-specific types, collecting constraints on structure. A grammatical unit is then exhaustively described by the complete list of types selected during a traversah this is called a selection expression. Additional mechanisms (in particular, the 'choosers') serve to enforce determinism: that is, rather than collect parallel compatible partial selection expressions, deterministic generation is enforced by appealing to semantic or lexical information as and when required. This approach, which is theoretically less than ideal, in fact supports quite efficient generation. It can be equated with the use of 'lean' formalisms in analysis-oriented GDE's.</Paragraph> <Paragraph position="4"> This paradigmatic design sketched here has proved to have significant consequences for the design of appropriate development environments. The properties of these development environments are also directly inferable from the properties of the linguistic descriptions they are to support. Among the results are: * a much improved mode of resource debugging, * a powerful treatment of multilinguality in linguistic resources, * and strong support for distributed large-scale grammar development.</Paragraph> <Paragraph position="5"> We will briefly note these features and then present some derivative functionalities that also represent differences between analysis and generated oriented GDE's. For the functional approaches, our concrete descriptions will be based on KPML: FUF is</Paragraph> <Paragraph position="7"> not explicitly multilingual and has as yet few visualization tools for resource development (limited, for example, to basic graphs of tile type lattice).</Paragraph> <Paragraph position="8"> KPML is more similar in its stage of development to, for example, ALEP, in that it offers a range ofvisualisation techniques for both the static resources and their dynamic use during generation, as well as support methods for resource construction, including versioning, resource merging, and distinct kinds of modularity for distributed development. FUF is still mostly concerned with the underlying engine for generation and represents a programming environment analogous to CUF or TDL.</Paragraph> <Paragraph position="9"> Beyond interactive tracing Experiences with debugging and maintaining large generation grammars lead to the conclusion that 'tracing' or 'stepping' during execution of the resources is usually not a useful way to proceed. This was the favored (i.e., only) mode of interaction with, for example, the Penman system in the 80s. This has been refined subsequently, both in Penman and in KPML and FUF, so that particular tracing can occur, interactively if required, only when selected linguistic objects (e.g., particular disjunctions, particular types of 'knowledge base' access, etc.) are touched during generation or when particular events in the generation process occurred. However, although always necessary as a last resort and for novices, this mode of debugging has now in KPML given way completely to 'result focusing'. Here the generated result (which can be partial in cases where the resources fail to produce a final generated string) serves as a point of entry to all decisions taken during generation. This can also be mediated by the syntactic structure generated.</Paragraph> <Paragraph position="10"> This is an effective means of locating resource problems since, with the very conservative 'formalism' supported (see above), there are only two possible sources of generation errors: first, when the linguistic resources defined cover the desired generation result but an incorrect grammatical feature is selected (due to incorrect semantic mappings, or to wrongly constrained grammatical selections elsewhere); and second, when the linguistic resources do not cover the desired result. This means that the debugging task always consists of locating where in the feature selections made during generation--i.e., in the selection expressions for the relevant grammatical units--an inappropriate selection occurred.</Paragraph> <Paragraph position="11"> The selection expression list is accessed from the user interface by clicking on any constituent, either from the generated string directly or from a graphical representation of the syntactic structure. The list itself can be viewed in three ways: (i) as a simple list functioning as a menu, (ii) as a graphical representation of the type lattice (always a selected subregion of the lattice as a whole) with the selected features highlighted, and (iii) as a animated graphical trace of the 'traversal' of the type lattice during generation. In addition, all the structural details of a generated string are controlled by syntactic constraints that have single determinate positions in the type lattice. It is therefore also possible to directly interrogate the generated string to ask where particular structural features of the string were introduced. This is a more focused way of selecting particular points in the type lattice as a whole for inspection.</Paragraph> <Paragraph position="12"> Figure 1 shows a screenshot during this latter kind of user activity. The user is attempting to find out what where the lexical constraints responsible for the selection of the noun &quot;TIME&quot; in the phrase &quot;At the same TIME&quot; were activated. Selecting to see the lexical class constraints imposed on this constituent (THING#3 in the structure top-right) gives a listing of applied constraints (lower-right). This indicates which lexical constraints were applicable (e.g., NOUN, COMMON-NOUN, etc.) and where in the type lattice these constraints were introduced (e.g., at the disjunction named HEAD-SUBSTITUTION, etc.). Clicking on the disjunction name brings up a graphical view of the disjunction with the associated structural constraints (upper-left). The feature selected from a disjunction is highlighted in a different color (or shade of grey: lexical-thing). The 'paradigmatic context' of the disjunction (i.e., where in the type lattice it is situated) is given to the left of the disjunction proper: this is a boolean expression over types presented in standard systemic notation.</Paragraph> <Paragraph position="13"> Several directions are then open to the user. The user can either follow the decisions made in the type lattice to the left (less specific) or to the right (more specific): navigating in either case a selected sub-graph of the type lattice. Alternatively, the user can inspect the semantic decisions that were responsible for the particular selection of grammatical feature in a disjunction. This 'upward' move is also supported graphically. The particular decisions made and their paths through semantic choice experts ('choosers') associated with each (grammatical) disjunction are shown highlighted. Since all objects presented to the user are mouse-sensitive, navigation and inspection proceeds by direct manipulation. All objects presented can be edited (either in situ or within automatically linked Emacs buffers). Any such changes are accumulated to define a patch version of the loaded resources; the user can subsequently create a distinct patch for the resources, or elect to accept the patches in the resource set. Generation itself is fast (due to a simple algorithm: see above), and so creating a new 'result string' for further debugging in the face of changes made is the quickest and most convenient way to conduct further tests. This eliminates any need for backtracking at the user development level. It is possible to examine contrastively the use of resources across distinct generation cycles.</Paragraph> <Paragraph position="14"> One useful way of viewing this kind of activity is by contrast to the state of affairs when debugging programs. KPML maintains the linguistic structure as an explicit record of the process of generation. All of the decisions that were made during generation are accessible via the traces they left in the generated structure. Such information is typically not available when debugging a computer program since when the execution stack has been unwound intermediate results have been lost. If certain intermediate results must consequently be re-examined, it is necessary to introduce tracing at appropriate points--a procedure that can now usually be avoided resulting in significantly faster cycles of debugging/testing.</Paragraph> <Paragraph position="15"> Multilingual representations The use of multilingual system networks has been motivated by, for example, Bateman, Matthiessen, Nanri and Zeng (Bateman et al., 1991). KPML provides support for such resources, including contrastive graphical displays of the type lattices for distinct languages. In addition, it is possible to merge automatically monolingual or multilingual resource definitions and to separate them out again as required. Importing segments of a type lattice for one language to form a segment for a distinct language is also supported. This has shown that it is not necessary to maintain a simple division between, for example, 'core' grammar and variations. Indeed, such a division is wasteful since language pairs differ in the areas they share. The support for this multilinguality is organized entirely around the paradigmatic type lattice. The support tools provided for manipulating such language-conditionalized lattices in KPML appear to significantly reduce the development time for generation resources for new languages. A black-and-white representation of a contrastive view based on the Eagles morphology recommendations is shown in Figure 2. The graph emphasizes areas held in common and explicitly labels parts of the lattice that are restricted in their language applicability.</Paragraph> <Paragraph position="16"> The possibilities supported for working multilingually (e.g., inheritance, merging resources) rely en- null tirely on the relative multilingual applicability of the paradigmatic organization of the grammar. It appears a fact of multilingual description that paradigmatic functional organizations are more likely to show substantial similarities across languages than are the syntagmatic structural descriptions. In an overview of resource definitions across 6 languages, it was found that the multilingual description only contains 32% of the number of objects that would be need if the 6 grammars were represented separately. Significant degrees of overlap have also been reported whenever a description of one language has been attempted on the basis of another (cf., e.g., (Alshawi et al., 1992; Rayner et al., 1996)).</Paragraph> <Paragraph position="17"> The paradigmatic basis simply extends the range of similarities that can be represented and provides the formal basis for providing computational tools that support the user when constructing language descriptions 'contrastively'.</Paragraph> <Paragraph position="18"> Distributed large-scale grammar development The paradigmatic organization of a large-scale grammar shows a further formal property that is utilized throughout the KPML GDE. Early work on systemic descriptions of English noted that emergence of 'functional regions': i.e., areas of the grammar overall that are concerned with particular areas of meaning. As Halliday notes: &quot;These \[functional\] components are reflected in the lexicogrammatical system in the form of discrete networks of options. Each ...is characterized by strong internal but weak external constraints: for example, any choice made in transitivity \[clause complementation\] has a significant effect on other choices within the transitivity systems, but has very little effect on choices within the mood \[speech act types\] or theme \[information structuring\] systems.&quot; (Halliday, 1978, pl13).</Paragraph> <Paragraph position="19"> This organization was first used computationally in the development of the Nigel grammar of English within the Penman project. Nigel was probably the first large-scale computational grammar whose precise authorship is difficult to ascertain because of the number of different linguists who have contributed to it at different times and locations.</Paragraph> <Paragraph position="20"> The basis for this successful example of distributed grammar development is the organization of the overall type lattice of the grammar into modular functional regions. This has now been taken as a strong design principle within KPML where all user access to the large type lattices making up a grammar is made through specific functional regions: for example, asking to graph the lattice will by default only present information within a single region (with special pointers out of the region to indicate broader connectivity). This is the paradigmatic equivalent of maintaining a structural grammar in modules related by particular syntactic forms. However, whereas the latter information is not strongly organizing for work on a generation grammar, the former is: work on a generation resource typically proceeds by expanding a selected area of expressive potential--i.e., the ability of the grammar to express some particular set of semantic requirements. This can include a range of grammatical forms and is best modularized along the paradigmatic dimension rather than the syntagmatic.</Paragraph> <Paragraph position="21"> The relative strength of intra-region connections in contrast to extra-region connections has provided a solid basis for distributed grammar development.</Paragraph> <Paragraph position="22"> Developers typically announce that they are interested in the expressive potential of some functional region. This both calls for others interested in the same functional region to exchange results cooperatively and warns generally that a functional region may be subject to imminent change. When a revised version of the region is available it replaces the previous version used. Integration of the new region is facilitated by a range of visualization tools and connectivity checks: the final test of acceptability the same results as with the previous region version and that a new test suite is provided that demonstrates the increased or revised functionality of the new region.</Paragraph> <Paragraph position="23"> Regions are defined across languages: the current multilingual resources released with KPML include around 60 regions. A partial region connectivity graph for the English grammar is shown in Figure 3.</Paragraph> <Paragraph position="24"> This graph also serves as a 'menu' for accessing further graphical views of the type lattice as well as selections from test suites illustrating use of the resources contained within a region. Dependencies between regions are thus clearly indicated.</Paragraph> <Paragraph position="25"> Integrated test suites Sets of linguistic resources for generation are typically provided with test suites: such test suites consist minimally of a semantic specification and the string that should result when generating. In KPML these are indexed according to the grammatical features that are selected during their generation. This permits examples of the use and consequences of any feature from the type lattice to be presented during debugging. This is one particularly effective way not only of checking the status of resources but also for documenting the resources. The complete generation history of examples can be examined in exactly the same way as newly generated strings. An interesting line of development underway is to investigate correspondences between the paradigmatic features describing features in a KPML-example set and those features used in the TSNLP initiative.</Paragraph> </Section> class="xml-element"></Paper>