File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/05/w05-1627_abstr.xml
Size: 6,261 bytes
Last Modified: 2025-10-06 13:44:41
<?xml version="1.0" standalone="yes"?> <Paper uid="W05-1627"> <Title>Spatial descriptions as referring expressions in the MapTask domain</Title> <Section position="2" start_page="0" end_page="0" type="abstr"> <SectionTitle> Abstract </SectionTitle> <Paragraph position="0"> We discuss work-in-progress on a hybrid approach to the generation of spatial descriptions, using the maps of the Map Task dialogue corpus as domain models. We treat spatial descriptions as referring expressions that distinguish particular points on the maps from all other points (potential 'distractors'). Our approach is based on rule-based overgeneration of spatial descriptions combined with ranking which currently is based on explicit goodness criteria but will ultimately be corpus-based.</Paragraph> <Paragraph position="1"> Ranking for content determination tasks such as referring expression generation raises a number of deep and vexing questions about the role of corpora in NLG, the kind of knowledge they can provide and how it is used.</Paragraph> <Paragraph position="2"> 1 Introduction, or: The lack of domain model annotation in corpora used for ranking in NLG In recent years, ranking approaches to Natural Language Generation (NLG) have become increasingly popular. They abandon the idea of generation as a deterministic decision-making process in favour of approaches that combine overgeneration with ranking at some stage in processing. A major motivation is the potential reduction of manual development costs and increased adaptability and robustness.</Paragraph> <Paragraph position="3"> Several approaches to sentence realization use ranking models trained on corpora of human-authored texts to judge the fluency of the candidates produced by the generation system. The work of [Langkilde and Knight, 1998; Langkilde, 2002] describes a sentence realizer that uses word ngram models trained on a corpus of 250 million words to rank candidates. [Varges and Mellish, 2001] present an approach to sentence realization that employs an instance-based ranker trained on a semantically annotated subset of the Penn treebank II ('Who's News' texts). [Ratnaparkhi, 2000] describes a sentence realizer that had been trained on a domain-specific corpus (in the air travel domain) augmented with semantic attribute-value pairs. [Bangalore and Rambow, 2000] describe a realizer that uses a word ngram model combined with a tree-based stochastic model trained on a version of the Penn treebank annotated in XTAG grammar format.</Paragraph> <Paragraph position="4"> [Karamanis et al., 2004] discuss centering-based metrics of coherence that could be used for choosing among competing text structures. The metrics are derived from the Gnome corpus [Poesio et al., 2004].</Paragraph> <Paragraph position="5"> In sum, these approaches use corpora with various types of annotation: syntactic trees, semantic roles, text structure, or no annotation at all (for word-based ngram models). However, what they all have in common, even when dealing with higher-level text structures, is the absence of any domain model annotation, i.e. information about the available knowledge pool from which the content was chosen. This seems to be unproblematic for surface realization where the semantic input has been determined beforehand.</Paragraph> <Paragraph position="6"> This paper asks what the lack of domain information means for ranking in the context of content determination, focusing on the generation of referring expressions (GRE). A particularly intriguing aspect of GRE is the role of distractors in choosing the content (types, attributes, relations) used to describe the target object(s). For example, we may describe a target as 'the red car' if there is also a blue one, but we may just describe it as 'the car' if there are no other cars in the domain (but possibly objects of other types). [Stone, 2003] proposes to use this observation to reason backwards from a given referring expression to the state of the knowledge base that motivated it. We may call this the 'presuppositional' or 'abductive' view of GRE. The approach is intended to address the knowledge acquisition bottleneck in NLG by means of example specifications constructed for the purpose of knowledge acquisition. It seems to us that, if the approach were to be applied to actual text corpora, one needed to address the fact that people often include 'redundant' attributes that do not eliminate any distractors. Thus, 'the red car' does not necessarily presuppose the existence of another car of different colour. Furthermore, there are likely to be a large number of domain models/knowledge bases that could have motivated the production of a referring expression.</Paragraph> <Paragraph position="7"> [Siddharthan and Copestake, 2004] take a corpus-based perspective and essentially regard a text as a knowledge base from which descriptions of domain objects can be extracted. Some NPs are descriptions of the same object (for example if they have the same head noun and share attributes and relations in certain ways), others are deemed distractors. It seems that, in contrast to [Stone, 2003], this approach cannot recover those domain objects or properties that are never mentioned because it only extracts what is explicitly stated in the text.</Paragraph> <Paragraph position="8"> Both the work reported in [Stone, 2003] and in [Siddharthan and Copestake, 2004] can be seen as attempts to deal with the lack of domain model information in situations where only the surface forms of referring expressions are given. Obtaining such a domain model is highly desirable in order to establish which part of a larger knowledge pool is actually selected for realization. This could be used to automatically learn models of content selection, for example. However, as we observed above, most corpora do not provide this kind of knowledge for obvious practical reasons. For example, how can we know what knowledge a Wall Street Journal author had available at the time of writing? In this paper, we describe work-in-progress on exploiting a corpus that provides not only surface forms but also domain model information: the MapTask dialogue corpus [Anderson et al., 1991].</Paragraph> </Section> class="xml-element"></Paper>