Discrete Optimization as an Alternative to Sequential Processing in NLG
Tomasz Marciniak and Michael Strube
EML Research gGmbH
Schloss-Wolfsbrunnenweg 33
69118 Heidelberg, Germany
http://www.eml-research.de/nlp
Abstract
We present an NLG system that uses Integer Lin-
ear Programming to integrate different decisions
involved in the generation process. Our approach
provides an alternative to pipeline-based sequential
processing which has become prevalent in today’s
NLG applications.
1 Introduction
From an engineering perspective, one of the major consid-
erations in building a Natural Language Generation (NLG)
system is the choice of the architecture. Two important issues
that need to be considered at this stage are firstly, the modu-
larization of the linguistic decisions involved in the genera-
tion process and secondly, the processing flow (cf. [De Smedt
et al., 1996]).
On one side of the spectrum lie integrated systems, with
all linguistic decisions being handled within a single process
(e.g. [Appelt, 1985]). Such architectures are theoretically at-
tractive, as they assume a close coordination of different types
of linguistic decisions, which are known to be dependent on
one another (cf. e.g. [Danlos, 1984]). A major disadvantage
of integrated models is the complexity that they necessarily
involve, which results in poor portability and scalability. On
the other side of the spectrum there are highly modularized
pipeline architectures. A prominent example of this second
case is the consensus pipeline architecture recognized by [Re-
iter, 1994] and further elaborated in [Reiter and Dale, 2000].
The modularization of Reiter’s model occurs at two levels.
First, individual linguistic decisions of the same type (e.g.
involving lexical or syntactic choice) are grouped together
within single low level tasks, such as lexicalization, aggre-
gation or ordering. Second, tasks are allocated to three high-
level generation stages, i.e. Document Planning, Microplan-
ning and Surface Realization. The processing flow in the
pipeline architecture is sequential, with individual tasks be-
ing executed in a predetermined order.
A study of applied NLG systems [Cahill and Reape, 1999]
reveals, however, that while most applied NLG systems rely
on sequential processing, they do not follow the strict modu-
larization that the consensus model assumes. Low-level tasks
are spread over various generation stages and may in fact be
executed more than once at diverse positions in the pipeline.
An attempt to account for commonalities that many NLG
systems share, without imposing too many restrictions, as is
the case with Reiter’s ”consensus” model, is the Reference
Architecture for Generation Systems (RAGS) [Mellish et al.,
2004]. RAGS is an abstract specification of an NLG architec-
ture that focuses on two issues: data types that the generation
process manipulates and a generic model of the interactions
between modules, based on a common central server. An
important feature of RAGS is that it leaves the question of
processing flow to the actual implementation. Hence it is the-
oretically possible to build both fully integrated as well as
pipeline-based systems that would observe the RAGS princi-
ples. Two implementations of RAGS presented in [Mellish
and Evans, 2004] demonstrate an intermediate way.
In this paper we present a novel approach to building an
integrated NLG system, in which the generation process is
modeled as a discrete optimization problem. It provides an
extension to the classification-based generation framework,
presented in [Marciniak and Strube, 2004]. We first assume
modularization of the generation process at the lowest possi-
ble level: individual tasks correspond to realizations of sin-
gle form elements (FEs) that build up a linguistic expression.
The decisions that these tasks involve are then represented
as classification tasks and integrated via an Integer Linear
Programming (ILP) formulation (see e.g. [Nemhauser and
Wolsey, 1999]. This way we avoid the well known ordering
problem that is present in all pipeline-based systems. Observ-
ing, at least partially, the methodological principles of RAGS,
we specify the architecture of our system at two independent
levels. At the abstract level, the low-level generation tasks
are defined, all based on the same input/output interface. At
the implementation level, the processing flow and integration
method are determined.
The rest of the paper is organized as follows: in Section 2
we present briefly the classification-based generation frame-
work and remark on the shortcomings of pipeline-based pro-
cessing. In Section 3 we introduce the ILP formulation of the
generation task, and in Section 4 we report on the experiments
and evaluation of the system.
2 Classification-Based Generation
In informal terms, classification can be characterized as the
task of assigning a class label to an unknown instance, given a
set of its properties and attributes represented as a feature vec-
α
β
β
α
β
β
α
α
conn:
2
adj_dir_dsc:
adj_rank_dsc:
right
and3
1
4
prep_lex:
phr_type:
adj_rank_phr:
along
PP
1
1 verb_lex:
s_exp: VP
verb_form: bare_inf
continue
continue
V
VP
Dc
prep_lex:
phr_type:
adj_rank_phr: 2
to
PP
2
3
conn:
adj_rank_dsc: 1
null
adj_dir_dsc: left
Dc
4
conn:
adj_rank_dsc: −
null
adj_dir_dsc:  −
Dc
null
2
CONN
Dc
Dc
Dc
*
*
Dc
CONN
Dc
Dc
Dc
CONN
Dc
Dc
Dc
andcontinue ...andfacing ...null continue ...
*
CONN
Dc
Dc
Dc
*Dc
facing ...null
VP
along the road
to the sports ...*
*
Dc
VP
VP
along the road
V
VP
continue
VP
VP
*
*
PP
to the sports ...
Dc
CONN Dc
null turn ...
Dc
CONN Dc
turn ...
PP PP
PP
Figure 1: LTAG-based derivation at the clause (left) and discourse levels (right). Elementary trees are represented as feature
vectors. Adjunction operations are marked with dashed arrows.
tor. In recent years supervised machine learning methods re-
lying on pre-classified training data have been applied in var-
ious areas of NLP to solve tasks formulated as classification
problems. In NLG machine learning methods have been used
to solve single tasks such as content selection and ordering
(e.g. [Duboue, 2004; Dimitromanolaki and Androutsopoulos,
2003]), lexicalization (e.g. [Reiter and Sripada, 2004]) and
referring expressions generation (e.g. [Cheng et al., 2001]).
In these applications classifiers trained on labeled data
have proven more robust and efficient than approaches us-
ing explicit expert knowledge. The difficulty of formaliz-
ing the linguistic knowledge involved in the development
of a knowledge-based system (a.k.a. knowledge-acquisition-
bottleneck) has been replaced with an effort of obtaining the
right kind of data, which typically involves annotating man-
ually a corpus of relevant texts with the required linguistic
information (cf. [Daelemans, 1993]).
The classification-based generation framework that we in-
troduced in [Marciniak and Strube, 2004] is based on a simple
idea that the linguistic form of an expression can be decom-
posed into a set of discrete form elements (FEs) representing
both its syntactic and lexical properties. The generation pro-
cess is then modeled as a series of classification tasks that re-
alize individual FEs. Realization of each FE is then regarded
as a single low-level generation task.
2.1 Route Directions
As the main application for this work we consider the task of
generating natural language route directions. An example of
such a text is given below:
(a) Facing the Wildcat statue, (b) turn left on the
brick sidewalk (c) and continue along the road to
the Sports Complex. (d) Make a right onto Concord
Road, (e) and keep going straight, (f) passing Pres-
byterian Church on your left, (g) until you reach
Copeland Street. (h) The library building will be
just around the corner on your right.
We analyze the content of instructional texts of this kind in
terms of temporally related situations, i.e. actions (b, c, d, e)
states (a, h) and events (f, g), denoted by individual discourse
units. The temporal structure of the texts is then modeled as a
tree, with nodes representing individual situation descriptions
and edges signaling the relations (see Figure 2). The seman-
tics of each discourse unit is further represented as a feature
vector describing the aspectual category and frame structure
of the profiled situation. This tree-based representation of the
semantic content of route directions constitutes the input to
the generation process. A detailed description of the underly-
ing conceptual model and the annotation process is presented
in [Marciniak and Strube, 2005].
initial subsequent
turn left on the brick sidewalk
Facing the Wildcat statue and continue along the road ...
subsequent
Make a right onto Concord Road
subsequent
and keep going straight
ongoing ongoing
subsequent
until you reach Copeland Street
The library building will be ...
passing Presbyterian Church ...
Figure 2: Temporal Structure
2.2 From LTAG to Form Elements
To specify an inventory of FEs that would become objects of
the low-level generation tasks, we first apply the Lexicalized
Tree Adjoining Grammar (LTAG) formalism (see e.g. [Joshi
and Schabes, 1991]) to model the linguistic form of the texts.
In LTAG, the derivation of a linguistic structure starts with a
selection of elementary trees, anchored by lexical items, such
as verbs or prepositions at the clause level and discourse con-
nectives at the discourse level (cf. [Webber and Joshi, 1998]).
In the next step, elementary trees are put together by means
of adjunction operations that follow the dependency structure
provided by the derivation tree. We take the temporal struc-
ture from Figure 2 to constitute the discourse level derivation
tree, with the temporal relationships corresponding to the syn-
tactic dependencies. At the clause level, the derivation tree is
isomorphic with the frame-based ontological representation
of individual situations (see [Marciniak and Strube, 2005]).
The clause- and discourse-level derivation of discourse unit
(c) from the above example in the context of (a) and (b) is
depicted in Figure 1. At the clause level, the set of elemen-
tary trees includes one initial tree α1 anchored by the main
verb, which also specifies the syntactic frame of the clause,
and auxiliary trees β1 and β2 corresponding to the verb argu-
ments. At the discourse level, the discourse unit which occu-
pies the root position in the temporal structure (cf. Figure 2)
Adj. Rank Adj. Dir. Conn. S Exp. Verb Lex. Verb Form
1 right and VP continue Bare Inf.
Phr. Type1 Prep. Lex.1 Adj. Rank1 Phr. Type2 Prep. Lex.2 Adj. Rank2
PP along 1 PP to 2
Table 1: FEs based form representation of and continue along
the road to the sport complex.
is modeled as the initial tree α2, and auxiliary trees β3 and β4
represent the remaining discourse units.
To model the whole process in a uniform way we en-
code the elementary trees as feature vectors, with individ-
ual features conveying syntactic (e.g. s exp) and lexical (e.g.
verb lex) information. Features adj rank and adj dir denote
respectively the ordering of the adjunction operations and the
adjunction direction, which both determine the linear struc-
ture of the text. Hence the form of the whole discourse can
be represented in terms of feature-value pairs used to encode
the initial trees and the derivation process. On that basis we
define a set of form elements building up a discourse as di-
rectly corresponding to the individual features. A detailed
description of the FEs is given below:
FE1: Adjunction Rank / Disc. Level specifies the linear
rank of each discourse unit at the local level, i.e. only clauses
temporally related to the same parent clause are considered.
FE2: Adjunction Direction is concerned with the position
of the child discourse unit relative to the parent one (e.g. (a)
left of (b), (c) right of (b), etc.).
FE3: Connective determines the lexical form of the dis-
course connective (e.g. null in (a), until in (g)).
FE4: S Expansion specifies whether a given discourse unit
is realized as a clause with the explicit subject (i.e. np+vp
expansion of the root S node in a clause) (e.g. (g, h)) or not
(e.g. (a), (b)).
FE5: Verb Form denotes the form of the main verb in a
clause (e.g. gerund in (a), (c), bare infinitive in (b), finite
present in (g), etc.).
FE6: Verb Lex. specifies the lexical form of the main verb
(e.g. turn in (b), pass in (f) or reach in (g)).
FE7: Phrase Type determines for each argument in a clause
its syntactic realization as a noun phrase (NP) , prepositional
phrase (PP) or a particle (P).
FE8: Preposition Lex. is concerned with the choice of a
lexical form for prepositions or particles in argument phrases
(e.g. left and on in (b) or along and to in (c)). If the value of
FE7 is NP, then this FE is set to none.
FE9: Adjunction Rank / Phr. Level specifies the linear
rank of each verb argument within a clause.
As an example, consider the FEs-based representation of
the form of clause (c) presented in Table 1. Realization of
each FEi is represented as a classification task Ti, with a set
of possible class labels corresponding to the different forms
that FEi may take. Only tasks T1 and T9 associated re-
spectively with Adjunction Rank / Disc. Level and Adjunction
Rank / Phr. Level are split into a series of binary precedence
classifications that determine the relative position of two dis-
course units or phrasal arguments at a time (e.g. (a) ≺ (c), (c)
≺ (d), and similarly along the road ≺ to the sports complex
etc.). These partial results are later combined to determine
the rank of the respective constituents.
Arguably, the above FEs and the corresponding tasks are
independent of the underlying grammatical model. In this
work we use the abstraction of the grammatical structure pro-
vided by LTAG, but the same or a similar set of FEs can
be readily derived from other formalisms (cf. e.g. [Meteer,
1990]). The role of the grammatical theory in defining form
elements is twofold. First, it specifies the exact position of
individual FEs in the grammatical structure, making it clear
how they should be assembled. Second, it ensures a wide
coverage: although the linguistic structures that we consider
here are relatively simple, the use of LTAG as the underlying
grammatical formalism guarantees that our generation frame-
work can be applied to producing much more complex con-
structions, both at the clause and discourse levels. Appar-
ently, this would require a richer feature vector representation
of the initial trees, and hence a larger number of FEs and the
corresponding generation tasks. The basic principles of the
generation process, however, would remain unchanged.
Notice also that the tasks considered here can be grouped
under the conventional NLG labels, such as text structuring
(i.e. T1, T2), lexicalization (i.e. T3, T6, T8) and sentence re-
alization (i.e. T4, T5, T9). Yet another important NLG task,
i.e. aggregation appears to be handled indirectly by T3 (e.g.
Turn left. Continue along the road. vs. Turn left and con-
tinue along the road.) and T5 (e.g. Keep going straight. You
will pass the Presbyterian Church on your right. vs. Keep go-
ing straight, passing the Presbyterian Church on your right.).
We view it as the strength of our approach that regardless of
their different linguistic character all these tasks are modeled
in exactly the same way.
2.3 System Architecture and Sequential
Processing
At an abstract level, the architecture of our system consists
of an unordered set of classifiers solving individual genera-
tion tasks. Each classifier is trained on a separate set of data
obtained from the corpus of route directions annotated with
both semantic and grammatical information.
In the previous work [Marciniak and Strube, 2004] we fol-
lowed the sequential paradigm advocated by [Daelemans and
van den Bosch, 1998] and implemented the system as a cas-
cade of classifiers. In such systems the output representation
is built incrementally, with subsequent classifiers having ac-
cess to the outputs of previous modules. An important char-
acteristic of this model is its extensibility. Since classifiers
rely on a uniform representation of the input (i.e. a feature
vector) and the output (i.e. a single feature value), it is easy to
change the ordering or insert new modules at any place in the
pipeline. Both operations only require retraining classifiers
with a new selection of the input features.
A major problem that we faced was that we found no
satisfactory method to determine the right ordering of in-
dividual classifiers that would guarantee optimal realization
of the grammatical form of the generated expression. We
found out that no matter what ordering we adopted tasks
that were solved at the begining had a lower accuracy as the
necessary contextual information, i.e. based on the outcomes
from other tasks, was missing. At the same time, subsequent
Start
l n1 l n2
l 22l 21
l 11 l 12
l nnm
l 22m
l1m1 1T
T2
Tn
c(l    )11
c(l   )12
c(l     )1m
1
c(l    )
22
2m2c(l     )c(l   )
21
...
...
...
...
...... ... ......
Figure 3: Sequential processing as a graph.
tasks were influenced by the initial decisions, which in some
cases led to error propagation. Apparently, this was due to
the well known fact that elements of the linguistic structure
are strongly correlated with one another (see e.g. [Danlos,
1984]). Hence individual generation decisions should not be
handled in isolation and arranging them in a fixed order will
always involve a specific ordering bias.
To get a feeling for the limitations that sequential process-
ing of generation tasks involves, consider its graphical repre-
sentation in Figure 3. The process corresponds to the best-
first traversal of a weighted multi-layered lattice. Separate
layers T1,...,Tn correspond to the individual tasks, and the
nodes at each layer (li1,...,limi) represent class labels for
each task1. In the sequential model only transitions between
nodes belonging to subsequent layers are granted. Each such
transition is augmented with a transition cost, which may be
affected by the traversal history but does not consider the fu-
ture choices. Nodes selected in this process represent the out-
comes of individual tasks. As can be seen, the process is lo-
cally driven and it does not guarantee an optimal realization
of the tasks.
As an example consider three interrelated form elements:
Connective, S Exp. and Verb Form and their different real-
izations presented in Table 2. Apparently each of these FEs
has the potential to affect the overall meaning of the discourse
unit or its stylistics. It can also be seen that only certain com-
binations of different forms are allowed in the given semantic
context. Different realization of any of these FEs would re-
quire other elements to be changed accordingly. To conclude,
following Danlos’ observation, we see no a priori reason to
impose any fixed ordering on the respective generation tasks,
and the experiments that we describe in Section 4 support this
position.
3 Discrete Optimization Model
As an alternative to sequential ordering of the generation
tasks we consider the metric labeling problem formulated by
[Kleinberg and Tardos, 2000], and originally applied in an
1Since different generation tasks may have varying numbers of
labels we denote the cardinality of Li, i.e. the set of possible labels
for task Ti, as mi.
Discourse Unit FE3 FE4 FE5
Pass the First Union Bank ... null vp bare inf.
It is necessary that you pass ... null np+vp bare inf.
Passing the First Union Bank ... null vp gerund
After passing the First Union Bank ... after vp gerund
After your passing . . . after np+vp gerund
As you pass the First Union Bank ... as np+vp fin. pres.
Until you pass the First Union Bank ... until np+vp fin. pres.
Until passing . . . until vp gerund
Table 2: Different realizations of form elements: Connective,
Verb Form and S Expansion. Rare but correct constructions
are in italics.
image restoration application, where classifiers determine the
”true” intensity values of individual pixels. This task is for-
mulated as a labeling function f : P → L which maps a
set P of n objects onto a set L of m possible labels. The
goal is to find an assignment that minimizes the overall cost
function Q(f) which has two components: assignment costs,
i.e. the costs of selecting a particular label for individual ob-
jects, and separation costs, i.e. the costs of selecting a pair
of labels for two related objects2. [Chekuri et al., 2001] pro-
posed an integer linear programming (ILP) formulation of the
metric labeling problem, with both assignment cost and sep-
aration costs being modeled as binary variables of the linear
cost function.
Recently, [Roth and Yih, 2004] applied an ILP model to
the task of the simultaneous assignment of semantic roles to
the entities mentioned in a sentence and recognition of the
relations holding between them. The assignment costs were
calculated on the basis of predictions of basic classifiers, i.e.
trained for both tasks individually with no access to the out-
comes of the other task. The separation costs were formulated
in terms of binary constraints which specified whether a spe-
cific semantic role could occur in a given relation, or not.
In the remainder of this paper, we present a more general
model, which we apply to the generation tasks presented in
Section 2. We put no limits on the number of tasks being
solved, and express the separation costs as stochastic con-
straints, which can be calculated off-line from the available
linguistic data.
3.1 ILP Formulation
We consider a general context in which the generation process
comprises a range of linguistic decisions modeled as a set of n
classification tasks T = {T1,...,Tn} which potentially form
mutually related pairs.
Each task Ti consists in assigning a label from Li =
{li1,...,limi} to an instance that represents the particular de-
cision. Assignments are modeled as variables of a linear
cost function. We differentiate between simple variables that
model individual assignments of labels and compound vari-
ables that represent respective assignments for each pair of
related tasks.
To represent individual assignments the following proce-
dure is applied: for each task Ti, every label from Li is asso-
2These costs were calculated as the function of the metric dis-
tance between a pair of pixels and the difference in intensity.
ciated with a binary variable x(lij). Each such variable rep-
resents a binary choice, i.e. a respective label lij is selected if
x(lij) = 1 or rejected otherwise. The coefficient of variable
x(lij) which models the assignment cost c(lij) is given by:
c(lij) = −log2(p(lij))
where p(lij) is the probability of lij being selected as the out-
come of task Ti. The probability distribution for each task
is provided by the basic classifiers that do not consider the
outcomes of other tasks3.
The role of compound variables is to provide pairwise con-
straints on the outcomes of individual tasks. Since we are
interested in constraining only those tasks are that truly de-
pendent on one another we first apply the contingency coeffi-
cient C to measure the degree of correlation for each pair of
tasks4. In the case of tasks Ti and Tk which are significantly
correlated, for each pair of labels fromLi×Lk we build a sin-
gle variable x(lij,lkp). Each such variable is associated with
a coefficient representing the constraint on the respective pair
of labels lij,lkp calculated in the following way:
c(lij,lkp) = −log2(p(lij,lkp))
with p(lij,lkp) denoting the prior joint probability of labels
lij and lkp in the data, which is independent from the general
classification context and hence can be calculated off-line5.
The ILP model consists of the target function and a set of
constraints which block illegal assignments (e.g. only one la-
bel of the given task can be selected)6. In our case the target
function is the cost function Q(f), which we want to mini-
mize:
min Q(f) =
summationdisplay
Ti∈T
summationdisplay
lij∈Li
c(lij)·x(lij)
+
summationdisplay
Ti,Tk∈T,i<k
summationdisplay
<lij,lkp>∈Li×Lk
c(lij,lkp)· x(lij,lkp)
Constraints need to be formulated for both the simple and
compound variables. First we want to ensure that exactly one
label lij belonging to task Ti is selected, i.e. only one simple
variable x(lij) representing labels of a given task can be set
to 1: summationdisplay
lij∈Li
x(lij) = 1, ∀i ∈ {1,...,n}
3In this case the ordering of tasks is not necessary, and the clas-
sifiers can run independently from each other.
4C is a test for measuring the association of two nominal vari-
ables, and hence adequate for the type of tasks that we consider
here. The coefficient takes values from 0 (no correlation) to 1 (com-
plete correlation) and is calculated by the formula: C = (χ2/(N +
χ2))1/2, where χ2 is the chi-squared statistic and N the total num-
ber of instances. The significance of C is then determined from the
value of χ2 for the given data. See e.g. [Goodman and Kruskal,
1972].
5In Section 4 we discuss an alternative approach which considers
the actual input.
6For a detailed overview of linear programming and different
types of LP problems see e.g. [Nemhauser and Wolsey, 1999].
l 11
1T
Tn
l 21
T2
l n1
l 1m1
l 2m2
l nmn
... ...
...
c(l    )11
c(l    ,l     )
11   2m
11   21
c(l    ,l    )
1m   21
c(l    ,l    )
c(l    ,l     )21   nm
c(l     ,l     )
c(l    ,l    )21   n1
c(l     ,l    )
2m   n1
2m   nmc(l     ,l     )
c(l     )2m21c(l    )
c(l    )n1
c(l     )nm
1mc(l     )
1m   2m
Figure 4: Graph representation of the ILP model.
We also require that if two simple variables x(lij) and
x(lkp), modeling respectively labels lij and lkp, are set to
1, then the compound variable x(lij,lkp), which models co-
occurrence of these labels, is also set to 1. This is done in
two steps: we first ensure that if x(lij) = 1, then exactly one
variable x(lij,lkp) must also be set to 1:
x(lij)−
summationdisplay
lkp∈Lk
x(lij,lkp) = 0,
∀i,k ∈ {1,...,n},i < k ∧ j ∈ {1,...,mi}
and do the same for variable x(lkp):
x(lkp)−
summationdisplay
lij∈Li
x(lij,lkp) = 0,
∀i,k ∈ {1,...,n},i < k ∧ p ∈ {1,...,mk}
Finally, we constrain the values of both simple and com-
pound variables to be binary:
x(lij) ∈ {0,1} ∧ x(lij,lkp) ∈ {0,1},
∀i,k ∈ {1,...,n} ∧ j ∈ {1,...,mi} ∧ p ∈ {1,...,mk}
We can represent the decision process that our ILP model
involves as a graph, with the nodes corresponding to indi-
vidual labels and the edges marking the associations between
labels belonging to correlated tasks. In Figure 4, task T1 is
correlated with task T2 and task T2 with task Tn. No corre-
lation exists for pair T1,Tn. Both nodes and edges are aug-
mented with costs. The goal is to select a subset of connected
nodes, minimizing the overall cost, given that for each group
of nodes T1,T2,...,Tn exactly one node must be selected,
and the selected nodes, representing correlated tasks, must
be connected. We can see that in contrast to the pipeline ap-
proach (cf. Figure 1), no local decisions determine the overall
assignment as the global distribution of costs is considered.
4 Experiments and Results
In order to evaluate our approach we conducted a series of
experiments with two implementations of the ILP model and
two different pipelines. Each system takes as input the tree-
based representation of the semantic content of route direc-
tions described in Section 2. The generation process traverses
the temporal tree in a depth-first fashion, and for each node a
single discourse unit is realized.
T :  Verb Form5
T :  Disc. Units Rank1
T :  Verb Lex6 4T :  S Exp.
T :  Connective3
T :  Disc. Units Dir.2
T :  Phrase Type7 T :  Prep. Lex8
T :  Phrase Rank9
Figure 5: Correlation network for the generation tasks.
null and as after until T3 Connective
T5 Verb Form
0.40 0.18 0 0 0 bare inf
0 0 0 0.04 0.01 gerund
0.05 0.01 0.06 0.03 0.06 fin pres
0.06 0.05 0 0 0 will inf
Table 3: Joint distribution matrix for selected labels of tasks
Connective (horizontal) and Verb Form (vertical), computed
for all discourse units in a corpus.
4.1 Correlations Between Tasks
We started with running the correlation tests for all pairs of
tasks. The obtained correlation network is presented in Fig-
ure 5. It is interesting to observe that tasks which realize FEs
belonging to the same levels of linguistic organization, and
have traditionally been handled within the same generation
stages (i.e. Text Planning, Microplanning and Realization) are
closely correlated with one another. This fact supports em-
pirically some assumptions behind Reiter’s consensus model.
On the other hand, there exist quite a few correlations that
extend over the stage boundaries, and all three lexicalization
tasks i.e. T3, T6 and T8 are correlated with many tasks of a
totally different linguistic character.
4.2 ILP Systems
We used the ILP model described in Section 3 to implement
two generation systems. To obtain assignment costs, both
systems get a probability distribution for each task from ba-
sic classifiers trained on the training data. To calculate the
separation costs, modeling the stochastic constraints on the
co-occurrence of labels, we considered correlated tasks only
(cf. Figure 5) and applied two calculation methods, which re-
sulted in two different system implementations.
In ILP1, for each pair of tasks we computed the joint distri-
bution of the respective labels considering all discourse units
in the training data before the actual input was known. Such
obtained joint distributions were used for generating all dis-
course units from the test data. An example matrix with joint
distribution for selected labels of tasks Connective and Verb
Form is given in Table 3. An advantage of this approach is
that the computation can be done in an offline mode and has
no impact on the run-time.
In ILP2, the joint distribution for a pair of tasks was cal-
culated at run-time, i.e. only after the actual input had been
null and as after until T3 Connective
T5 Verb Form
0.13 0.02 0 0 0 bare inf
0 0 0 0 0 gerund
0 0 0.05 0.02 0.27 fin pres
0.36 0.13 0 0 0 will inf
Table 4: Joint distribution matrix for tasks Connective and
Verb Form, considering only disc. units similar to (c): until
you see the river side in front of you, at Phi-threshold ≥ 0.8.
known. This time we did not consider all discourse units in
the training data, but only those whose meaning, represented
as a feature vector, was similar to the meaning of the input
discourse unit. As a similarity metric we used the Phi co-
efficient7, and set the similarity threshold at 0.8. As can be
seen from Table 4, the probability distribution computed in
this way is better suited to the specific semantic context. This
is especially important if the available corpus is small and the
frequency of certain pairs of labels might be too low to have
a significant impact on the final assignment.
4.3 Pipeline Systems
As a baseline we implemented two pipeline systems. In the
first one we used the ordering of tasks that resembles most
closely the standard NLG pipeline and which we also used
before in [Marciniak and Strube, 2004]8.
Individual classifiers had access to both the semantic fea-
tures, and the features output by the previous modules. To
train the classifiers, the correct feature values were extracted
from the training data and during testing the generated, and
hence possibly erroneous, values were taken.
In the other pipeline system we wanted to minimize the
error-propagation effect and placed the tasks in the order of
decreasing accuracy. To determine the ordering of tasks we
applied the following procedure: the classifier with the high-
est baseline accuracy was selected as the first one. The re-
maining classifiers were trained and tested again, but this time
they had access to the additional feature. Again, the classifier
with the highest accuracy was selected and the procedure was
repeated until all classifiers were ordered.
4.4 Evaluation
We evaluated our system using leave-one-out cross-
validation, i.e. for all texts in the corpus, each text was used
once for testing, and the remaining texts provided the training
data. To solve individual classification tasks we used the de-
cision tree learner C4.5 in the pipeline systems and the Naive
Bayes algorithm9 in the ILP systems. Both learning schemes
7Phi is a measure of the extent of correlation between two sets
of binary variables, see e.g. [Edwards, 1976]. To represent multi-
class features on a binary scale we applied dummy coding which
transforms multi class-nominal variables to a set of dummy variables
with binary values.
8The ordering of tasks is given in Table 5.
9Both implemented in the Weka machine learning software [Wit-
ten and Frank, 2000].
Pipeline 1 Pipeline 2 ILP 1 ILP 2
Tasks Pos. Accuracy κ Pos. Accuracy κ Accuracy κ Accuracy κ
Dis.Un. Rank 1 96.81% 90.90% 2 96.81% 90.90% 97.43% 92.66% 97.43% 92.66%
Dis.Un. Pos. 2 98.04% 89.64% 1 98.04% 89.64% 96.10% 77.19% 97.95% 89.05%
Connective 3 78.64% 60.33% 8 79.10% 61.14% 79.15% 61.22% 79.36% 61.31%
S Exp. 4 95.90% 89.45% 3 96.20% 90.17% 99.48% 98.65% 99.49% 98.65%
Verb Form 5 86.76% 77.01% 4 87.83% 78.90% 92.81% 87.60% 93.22% 88.30%
Verb Lex. 6 64.58% 60.87% 9 67.40% 64.19% 75.87% 73.69% 76.08% 74.00%
Phr. Type 7 86.93% 75.07% 5 87.08% 75.36% 87.33% 76.75% 88.03% 77.17%
Prep. Lex. 8 86.23% 81.12% 6 86.03% 81.10% 87.28% 82.20% 88.59% 83.24%
Phr. Rank 9 84.73% 75.24% 7 86.95% 78.65% 90.22% 84.02% 91.27% 85.72%
Phi 0.85 0.87 0.89 0.90
Table 5: Results reached by the implemented ILP systems and two baselines. For both pipeline systems, Pos. stands for the
position of the tasks in the pipeline.
yielded highest results in the respective configurations10. To
solve the ILP models we used lp solve, a highly efficient
GNU-licence Mixed Integer Programming (MIP) solver11,
that implements the Branch-and-Bound algorithm. For each
task we applied a feature selection procedure (cf. [Kohavi and
John, 1997]) to determine which semantic features should be
taken as the input by the basic classifiers.
To evaluate individual tasks we applied two metrics: accu-
racy, calculated as the proportion of correct classifications to
the total number of instances, and the κ statistic, which cor-
rects for the proportion of classifications that might occur by
chance. For end-to-end evaluation, we applied the Phi coef-
ficient to measure the degree of similarity between the vector
representations of the generated form (i.e. built from the out-
comes of individual tasks) and the reference form obtained
from the test data. The Phi-based similarity metric is simi-
lar to κ as it compensates for the fact that a match between
two multi-label features is more difficult to obtain than in the
case of binary features. This measure tells us how well all the
tasks have been solved together, which in our case amounts
to generating the whole text.
The results presented in Table 5 show that the ILP systems
achieved highest accuracy and κ for most tasks and reached
the highest overall Phi score. Notice that ILP2 improved the
accuracy of both pipeline systems for the three correlated
tasks that we discussed before, i.e. Connective, S Exp. and
Verb Form. Another group of correlated tasks for which the
results appear interesting are i.e. Verb Lex., Phrase Type and
Phrase Rank (cf. Figure 3). Notice that Verb Lex. got higher
scores in Pipeline2, with outputs from both Phrase Type and
Phrase Rank (see the respective pipeline positions), but the re-
verse effect did not occur: scores for both phrase tasks were
lower in Pipeline1 when they had access to the output from
Verb Lex., contrary to what we might expect. Apparently, this
was due to the low accuracy for Verb Lex. which caused the
10We have found that in direct comparison C4.5 performs better
than Naive Bayes but the probability distribution that it outputs is
strongly biased towards the winning label. In this case it is practi-
cally impossible for the ILP system to change the classifier’s deci-
sion, as the costs of other labels get extremely high. Hence the more
balanced probability distribution given by Naive Bayes can be easier
corrected in the optimization process.
11http://www.geocities.com/lpsolve/
already mentioned error propagation. This example shows
well the advantage that optimization processing brings: both
ILP systems reached much higher scores for all three tasks.
Finally, it appears as no coincidence that the three tasks in-
volving lexical choice, i.e. Connective, Verb Lex. and Prepo-
sition Lex. scored lower than the syntactic tasks in all sys-
tems. This can be attributed partially to the limitations of
retrieval measures which do not allow for the fact, that in a
given semantic content more than one lexical form can be ap-
propriate.
5 Conclusions
In this paper we showed that the pipeline architecture in an
NLG application can be successfully replaced with an inte-
grated ILP-based model which is better suited to handling
correlated generation decisions. To the best of our knowl-
edge, linear programming has been used in an NLG related
work only by [Althaus et al., 2004] to solve a single task of
determining the order of discourse constituents. In a some-
what related context [Dras, 1999] used ILP to optimize the
task of text paraphrasing, given global constraints such as text
and sentence length, readibilty, etc.
In contrast, in this work we use an ILP model to orga-
nize the entire process of generating the surface form from
an underlying semantic representation, which involves an
integration of different types of NLG tasks. Although in
our system we use machine learning as the primary deci-
sion making mechanism, we believe that the ILP model can
also be used with knowledge-based systems that observe the
classification-oriented formulation of the NLG tasks.
Finally, we are convinced that an adequate evaluation of an
NLG system must at some stage go beyond the application of
quantitative measures. Nevertheless, it is reasonable to expect
that the improvement that we reached with the ILP system,
especially the increase of the overall Phi score, must correlate
to some extent with the quality improvement. To verify it
we are currently proceeding with qualitative evaluation of the
output from our system.
Acknowledgements: The work presented here has been
funded by the Klaus Tschira Foundation, Heidelberg, Ger-
many. The first author receives a scholarship from KTF
(09.001.2004).

References
[Althaus et al., 2004] Ernst Althaus, Nikiforos Karamanis, and
Alexander Koller. Computing locally coherent discourses. In
Proceedings of the 42nd Annual Meeting of the Association for
Computational Linguistics, Barcelona, Spain, July 21-26, 2004,
pages 399–406, 2004.
[Appelt, 1985] Douglas Appelt. Planning English Sentences. Cam-
bridge University Press, Cambridge, UK, 1985.
[Cahill and Reape, 1999] Lynne Cahill and Mike Reape. Compo-
nent tasks in applied NLG systems. Technical Report ITRI-99-
05, ITRI, University of Brighton, March 1999.
[Chekuri et al., 2001] Chandra Chekuri, Sanjeev Khanna, Joseph
Naor, and Leonid Zosin. Approximation algorithms for the met-
ric labeling problem via a new linear programming formulation.
In Proceedings of the 12th Annual ACM SIAM Symposium on
Discrete Algorithms, Washington, DC, pages 109–118, 2001.
[Cheng et al., 2001] Hua Cheng, Massimo Poesio, Renate Hen-
schel, and Chris Mellish. Corpus-based NP modifier generation.
In Proceedings of the 2nd Meeting of the North American Chap-
ter of the Association for Computational Linguistics, Pittsburgh,
PA, 2-7 June, 2001, pages 9–16, 2001.
[Daelemans and van den Bosch, 1998] Walter Daelemans and An-
tal van den Bosch. Rapid development of NLP modules with
memory-based learning. In Proceedings of ELSNET in Wonder-
land. Utrecht: ELSNET, pages 105–113, 1998.
[Daelemans, 1993] Walter Daelemans. Memory-based lexical ac-
quisition and processing. In Proceedings of the Third Interna-
tional EAMT Workshop on Machine Translation and the Lexicon,
Heidelberg, Germany, 26-28 April, 1993, pages 85–98, 1993.
[Danlos, 1984] Laurence Danlos. Conceptual and linguistic deci-
sions in generation. In Proceedings of the 10th International
Conference on Computational Linguistics, Stanford, Cal., pages
501–504, 1984.
[De Smedt et al., 1996] Koenraad De Smedt, Helmut Horacek, and
Michael Zock. Architectures for natural language generation:
Problems and perspectives. In G. Adorni and M. Zock, editors,
Trends in Natural Language Generation: An Artificial Intelli-
gence Perspective, pages 17–46. Springer Verlag, 1996.
[Dimitromanolaki and Androutsopoulos, 2003] Aggeliki Dimitro-
manolaki and Ion Androutsopoulos. Learning to order facts for
discourse planning in natural language generation. In Proc. of
the 9th European Workshop on Natural Language Generation,
Budapest, Hungary, 13 – 14 April 2003, pages 23–30, 2003.
[Dras, 1999] Mark Dras. Tree Adjoining Grammar and the Reluc-
tant Paraphrasing of Text. PhD thesis, Macquarie University,
Australia, 1999.
[Duboue, 2004] Pablo A. Duboue. Indirect supervised learning
of content selection logic. In Proceedings of the 3rd Interna-
tional Conference on Natural Language Generation, Brocken-
hurst, UK, 14-16 July, 2004, pages 41–50, 2004.
[Edwards, 1976] Allen L. Edwards. An Introduction to Linear Re-
gression and Correlation. W. H. Freema, San Francisco, Cal.,
1976.
[Goodman and Kruskal, 1972] Leo A. Goodman and W. H.
Kruskal. Measures of association for cross-classification, iv.
Journal of the American Statistical Association, 67:415–421,
1972.
[Joshi and Schabes, 1991] Aravind K. Joshi and Yves Schabes.
Tree-adjoining grammars and lexicalized grammars. In Maurice
Nivat and Andreas Podelski, editors, Definability and Recogniz-
ability of Sets of Trees. Elsevier, 1991.
[Kleinberg and Tardos, 2000] Jon M. Kleinberg and Eva Tardos.
Approximation algorithms for classification problems with pair-
wise relationships: Metric labeling and Markov random fields.
Journal of the ACM, 49(5):616–639, 2000.
[Kohavi and John, 1997] Ron Kohavi and George H. John. Wrap-
pers for feature subset selection. Artificial Intelligence Journal,
97:273–324, 1997.
[Marciniak and Strube, 2004] Tomasz Marciniak and Michael
Strube. Classification-based generation using TAG. In Pro-
ceedings of the 3rd International Conference on Natural
Language Generation, Brockenhurst, UK, 14-16 July, 2004,
pages 100–109, 2004.
[Marciniak and Strube, 2005] Tomasz Marciniak and Michael
Strube. Modeling and annotating the semantics of route di-
rections. In Proceedings of the 6th International Workshop on
Computational Semantics, Tilburg, The Netherlands, January
12-14, 2005, pages 151–162, 2005.
[Mellish and Evans, 2004] Chris Mellish and Roger Evans. Imple-
mentation architectures for natural language generation. Natural
Language Engineering, 10(3/4):261–282, 2004.
[Mellish et al., 2004] Chris Mellish, Mike Reape, Donia Scott,
Lynne Cahill, Roger Evans, and Daniel Paiva. A reference archi-
tecture for generation systems. Natural Language Engineering,
10(3/4):227–260, 2004.
[Meteer, 1990] Marie W. Meteer. Abstract linguistic resources for
text planning. In Proceedings of the 5th International Workshop
on Natural Language Generation, Pittsburgh, PA, 3-6 June 1990,
pages 62–69, 1990.
[Nemhauser and Wolsey, 1999] George L. Nemhauser and Lau-
rence A. Wolsey. Integer and combinatorial optimization. Wiley,
New York, NY, 1999.
[Reiter and Dale, 2000] Ehud Reiter and Robert Dale. Building
Natural Language Generation Systems. Cambridge University
Press, Cambridge, UK, 2000.
[Reiter and Sripada, 2004] Ehud Reiter and Somayajulu Sripada.
Contextual influences on near-synonym choice. In Proceedings
of the 3rd International Conference on Natural Language Gen-
eration, Brockenhurst, UK, 14-16 July, 2004, pages 161–170,
2004.
[Reiter, 1994] Ehud Reiter. Has a consensus NL generation archi-
tecture appeared, and is it psycholinguistically plausible? In
Proc. of the 7th International Workshop on Natural Language
Generation, Kennebunkport, MA, 21-24 June 1994, pages 160–
173, 1994.
[Roth and Yih, 2004] Dan Roth and Wen-tau Yih. A linear pro-
gramming formulation for global inference in natural language
tasks. In Proceedings of the 8th Conference on Computational
Natural Language Learning, Boston, Mass., May 2-7, 2004,
pages 1–8, 2004.
[Webber and Joshi, 1998] Bonnie Lynn Webber and Aravind Joshi.
Anchoring a lexicalized tree-adjoining grammar for discourse. In
Proceedings of the COLING/ACL ’98 Workshop on Discourse
Relations and Discourse Markers, Montr´eal, Qu´ebec, Canada,
15 August 1998, pages 86–92, 1998.
[Witten and Frank, 2000] Ian H. Witten and Eibe Frank. Data Min-
ing - Practical Machine Learning Tools and Techniques with Java
Implementations. Morgan Kaufmann, San Francisco, Cal., 2000.
