File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/05/p05-1009_intro.xml
Size: 5,912 bytes
Last Modified: 2025-10-06 14:03:01
<?xml version="1.0" standalone="yes"?> <Paper uid="P05-1009"> <Title>Towards Developing Generation Algorithms for Text-to-Text Applications</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Many of today's most popular natural language applications - Machine Translation, Summarization, Question Answering - are text-to-text applications.</Paragraph> <Paragraph position="1"> That is, they produce textual outputs from inputs that are also textual. Because these applications need to produce well-formed text, it would appear natural that they are the favorite testbed for generic generation components developed within the Natural Language Generation (NLG) community. Over the years, several proposals of generic NLG systems have been made: Penman (Matthiessen and Bateman, 1991), FUF (Elhadad, 1991), Nitrogen (Knight and Hatzivassiloglou, 1995), Fergus (Bangalore and Rambow, 2000), HALogen (Langkilde-Geary, 2002), Amalgam (Corston-Oliver et al., 2002), etc.</Paragraph> <Paragraph position="2"> Instead of relying on such generic NLG systems, however, most of the current text-to-text applications use other means to address the generation need. In Machine Translation, for example, sentences are produced using application-specific &quot;decoders&quot;, inspired by work on speech recognition (Brown et al., 1993), whereas in Summarization, summaries are produced as either extracts or using task-specific strategies (Barzilay, 2003). The main reason for which text-to-text applications do not usually involve generic NLG systems is that such applications do not have access to the kind of information that the input representation formalisms of current NLG systems require. A machine translation or summarization system does not usually have access to deep subject-verb or verb-object relations (such as ACTOR, AGENT, PATIENT, POSSESSOR, etc.) as needed by Penman or FUF, or even shallower syntactic relations (such as subject, object, premod, etc.) as needed by HALogen.</Paragraph> <Paragraph position="3"> In this paper, following the recent proposal made by Nederhof and Satta (2004), we argue for the use of IDL-expressions as an applicationindependent, information-slim representation language for text-to-text natural language generation.</Paragraph> <Paragraph position="4"> IDL-expressions are created from strings using four operators: concatenation ( a2 ), interleave ( a3 ), disjunction (a4 ), and lock ( a5 ). We claim that the IDL formalism is appropriate for text-to-text generation, as it encodes meaning only via words and phrases, combined using a set of formally defined operators.</Paragraph> <Paragraph position="5"> Appropriate words and phrases can be, and usually are, produced by the applications mentioned above.</Paragraph> <Paragraph position="6"> The IDL operators have been specifically designed to handle natural constraints such as word choice and precedence, constructions such as phrasal combination, and underspecifications such as free word order.</Paragraph> <Paragraph position="7"> In Table 1, we present a summary of the representation and generation characteristics of current NLG systems. We mark by a0 characteristics that are needed/desirable in a generation component for text-to-text applications, and by a1 characteristics that make the proposal inapplicable or problematic. For instance, as already argued, the representation formalism of all previous proposals except for IDL is problematic (a1 ) for text-to-text applications. The IDL formalism, while applicable to text-to-text applications, has the additional desirable property that it is a compact representation, while formalisms such as word-lattices and non-recursive CFGs can have exponential size in the number of words available for generation (Nederhof and Satta, 2004).</Paragraph> <Paragraph position="8"> While the IDL representational properties are all desirable, the generation mechanism proposed for IDL by Nederhof and Satta (2004) is problematic (a1 ), because it does not allow for scoring and ranking of candidate realizations. Their generation mechanism, while computationally efficient, involves intersection with context free grammars, and therefore works by excluding all realizations that are not accepted by a CFG and including (without ranking) all realizations that are accepted.</Paragraph> <Paragraph position="9"> The approach to generation taken in this paper is presented in the last row in Table 1, and can be summarized as a a0 tiling of generation characteristics of previous proposals (see the shaded area in Table 1). Our goal is to provide an optimal generation framework for text-to-text applications, in which the representation formalism, the generation mechanism, and the computational properties are all needed and desirable (a0 ). Toward this goal, we present a new generation mechanism that intersects IDL-expressions with probabilistic language models. The generation mechanism implements new algorithms, which cover a wide spectrum of run-time behaviors (from linear to exponential), depending on the complexity of the input. We also present theoretical results concerning the correctness and the efficiency input IDL-expression) of our algorithms.</Paragraph> <Paragraph position="10"> We evaluate these algorithms by performing experiments on a challenging word-ordering task.</Paragraph> <Paragraph position="11"> These experiments are carried out under a high-complexity generation scenario: find the most probable sentence realization under an n-gram language model for IDL-expressions encoding bags-of-words of size up to 25 (up to 10a2a4a3 possible realizations!). Our evaluation shows that the proposed algorithms are able to cope well with such orders of complexity, while maintaining high levels of accuracy.</Paragraph> </Section> class="xml-element"></Paper>