Proceedings of the 9th Conference on Computational Natural Language Learning (CoNLL),
pages 136–143, Ann Arbor, June 2005. c©2005 Association for Computational Linguistics
Beyond the Pipeline: Discrete Optimization in NLP
Tomasz Marciniak and Michael Strube
EML Research gGmbH
Schloss-Wolfsbrunnenweg 33
69118 Heidelberg, Germany
http://www.eml-research.de/nlp
Abstract
We present a discrete optimization model based on
a linear programming formulation as an alternative
to the cascade of classi ers implemented in many
language processing systems. Since NLP tasks are
correlated with one another, sequential processing
does not guarantee optimal solutions. We apply our
model in an NLG application and show that it per-
forms better than a pipeline-based system.
1 Introduction
NLP applications involve mappings between com-
plex representations. In generation a representa-
tion of the semantic content is mapped onto the
grammatical form of an expression, and in analy-
sis the semantic representation is derived from the
linear structure of a text or utterance. Each such
mapping is typically split into a number of differ-
ent tasks handled by separate modules. As noted
by Daelemans & van den Bosch (1998), individ-
ual decisions that these tasks involve can be formu-
lated as classi cation problems falling in either of
two groups: disambiguation or segmentation. The
use of machine-learning to solve such tasks facil-
itates building complex applications out of many
light components. The architecture of choice for
such systems has become a pipeline, with strict or-
dering of the processing stages. An example of
a generic pipeline architecture is GATE (Cunning-
ham et al., 1997) which provides an infrastructure
for building NLP applications. Sequential process-
ing has also been used in several NLG systems (e.g.
Reiter (1994), Reiter & Dale (2000)), and has been
successfully used to combine standard preprocess-
ing tasks such as part-of-speech tagging, chunking
and named entity recognition (e.g. Buchholz et al.
(1999), Soon et al. (2001)).
In this paper we address the problem of aggregat-
ing the outputs of classi ers solving different NLP
tasks. We compare pipeline-based processing with
discrete optimization modeling used in the  eld of
computer vision and image recognition (Kleinberg
& Tardos, 2000; Chekuri et al., 2001) and recently
applied in NLP by Roth & Yih (2004), Punyakanok
et al. (2004) and Althaus et al. (2004). Whereas
Roth and Yih used optimization to solve two tasks
only, and Punyakanok et al. and Althaus et al. fo-
cused on a single task, we propose a general for-
mulation capable of combining a large number of
different NLP tasks. We apply the proposed model
to solving numerous tasks in the generation process
and compare it with two pipeline-based systems.
The paper is structured as follows: in Section 2 we
discuss the use of classi ers for handling NLP tasks
and point to the limitations of pipeline processing.
In Section 3 we present a general discrete optimiza-
tion model whose application in NLG is described
in Section 4. Finally, in Section 5 we report on the
experiments and evaluation of our approach.
2 Solving NLP Tasks with Classi ers
Classi cation can be de ned as the task Ti of as-
signing one of a discrete set of mi possible labels
Li = {li1,...,limi}1 to an unknown instance. Since
generic machine-learning algorithms can be applied
to solving single-valued predictions only, complex
1Since we consider different NLP tasks with varying num-
bers of labels we denote the cardinality of Li, i.e. the set of
possible labels for task Ti, as mi.
136
Start
l n1 l n2
l 22l 21
l 11 l 12
l nnm
l 22m
l1m1
p(l    )11 p(l     )1m1
p(l   )
12
1T
T2
Tn
2m2p(l     )p(l    )
22
p(l   )
21
...
...
...
...
...... ... ......
Figure 1: Sequential processing as a graph.
structures, such as parse trees, coreference chains or
sentence plans, can only be assembled from the out-
puts of many different classi ers.
In an application implemented as a cascade of
classi ers the output representation is built incre-
mentally, with subsequent classi ers having access
to the outputs of previous modules. An important
characteristic of this model is its extensibility: it
is generally easy to change the ordering or insert
new modules at any place in the pipeline2. A ma-
jor problem with sequential processing of linguis-
tic data stems from the fact that elements of linguis-
tic structure, at the semantic or syntactic levels, are
strongly correlated with one another. Hence clas-
si ers that have access to additional contextual in-
formation perform better than if this information is
withheld. In most cases, though, if task Tk can use
the output of Ti to increase its accuracy, the reverse
is also true. In practice this type of processing may
lead to error propagation. If due to the scarcity of
contextual information the accuracy of initial clas-
si ers is low, erroneous values passed as input to
subsequent tasks can cause further misclassi cations
which can distort the  nal outcome (also discussed
by Roth and Yih and van den Bosch et al. (1998)).
As can be seen in Figure 1, solving classi ca-
tion tasks sequentially corresponds to the best- rst
traversal of a weighted multi-layered lattice. Nodes
at separate layers (T1,...,Tn) represent labels of dif-
ferent classi cation tasks and transitions between
the nodes are augmented with probabilities of se-
2Both operations only require retraining classi ers with a
new selection of the input features.
lecting respective labels at the next layer. In the se-
quential model only transitions between nodes be-
longing to subsequent layers are allowed. At each
step, the transition with the highest local probability
is selected. Selected nodes correspond to outcomes
of individual classi ers. This graphical representa-
tion shows that sequential processing does not guar-
antee an optimal context-dependent assignment of
class labels and favors tasks that occur later, by pro-
viding them with contextual information, over those
that are solved  rst.
3 Discrete Optimization Model
As an alternative to sequential ordering of NLP
tasks we consider the metric labeling problem for-
mulated by Kleinberg & Tardos (2000), and orig-
inally applied in an image restoration application,
where classi ers determine the  true intensity val-
ues of individual pixels. This task is formulated as a
labeling function f : P → L, that maps a set P of n
objects onto a set L of m possible labels. The goal
is to  nd an assignment that minimizes the overall
cost function Q(f), that has two components: as-
signment costs, i.e. the costs of selecting a particular
label for individual objects, and separation costs, i.e.
the costs of selecting a pair of labels for two related
objects3. Chekuri et al. (2001) proposed an integer
linear programming (ILP) formulation of the met-
ric labeling problem, with both assignment cost and
separation costs being modeled as binary variables
of the linear cost function.
Recently, Roth & Yih (2004) applied an ILP
model to the task of the simultaneous assignment
of semantic roles to the entities mentioned in a sen-
tence and recognition of the relations holding be-
tween them. The assignment costs were calculated
on the basis of predictions of basic classi ers, i.e.
trained for both tasks individually with no access to
the outcomes of the other task. The separation costs
were formulated in terms of binary constraints, that
speci ed whether a speci c semantic role could oc-
cur in a given relation, or not.
In the remainder of this paper, we present a more
general model, that is arguably better suited to hand-
ling different NLP problems. More speci cally, we
3These costs were calculated as the function of the metric
distance between a pair of pixels and the difference in intensity.
137
put no limits on the number of tasks being solved,
and express the separation costs as stochastic con-
straints, which for almost any NLP task can be cal-
culated off-line from the available linguistic data.
3.1 ILP Formulation
We consider a general context in which a speci c
NLP problem consists of individual linguistic de-
cisions modeled as a set of n classi cation tasks
T = {T1,...,Tn}, that potentially form mutually
related pairs. Each task Ti consists in assigning a
label from Li = {li1,...,limi} to an instance that
represents the particular decision. Assignments are
modeled as variables of a linear cost function. We
differentiate between simple variables that model in-
dividual assignments of labels and compound vari-
ables that represent respective assignments for each
pair of related tasks.
To represent individual assignments the following
procedure is applied: for each task Ti, every label
from Li is associated with a binary variable x(lij).
Each such variable represents a binary choice, i.e. a
respective label lij is selected if x(lij) = 1 or re-
jected otherwise. The coef cient of variable x(lij),
that models the assignment cost c(lij), is given by:
c(lij) = −log2(p(lij))
where p(lij) is the probability of lij being selected as
the outcome of task Ti. The probability distribution
for each task is provided by the basic classi ers that
do not consider the outcomes of other tasks4.
The role of compound variables is to provide pair-
wise constraints on the outcomes of individual tasks.
Since we are interested in constraining only those
tasks that are truly dependent on one another we  rst
apply the contingency coef cient C to measure the
degree of correlation for each pair of tasks5.
In the case of tasks Ti and Tk which are sig-
ni cantly correlated, for each pair of labels from
4In this case the ordering of tasks is not necessary, and the
classi ers can run independently from each other.
5C is a test for measuring the association of two nominal
variables, and hence adequate for the type of tasks that we con-
sider here. The coef cient takes values from 0 (no correlation)
to 1 (complete correlation) and is calculated by the formula:
C = (χ2/(N + χ2))1/2, where χ2 is the chi-squared statistic
and N the total number of instances. The signi cance of C is
then determined from the value of χ2 for the given data. See
e.g. Goodman & Kruskal (1972).
Li × Lk we build a single variable x(lij,lkp). Each
such variable is associated with a coef cient repre-
senting the constraint on the respective pair of labels
lij,lkp calculated in the following way:
c(lij, lkp) = −log2(p(lij,lkp))
with p(lij,lkp) denoting the prior joint probability of
labels lij, and lkp in the data, which is independent
from the general classi cation context and hence can
be calculated off-line6.
The ILP model consists of the target function and
a set of constraints which block illegal assignments
(e.g. only one label of the given task can be se-
lected)7. In our case the target function is the cost
function Q(f), which we want to minimize:
min Q(f) =
summationdisplay
Ti∈T
summationdisplay
lij∈Li
c(lij) · x(lij)
+
summationdisplay
Ti,Tk∈T,i<k
summationdisplay
lij,lkp∈Li×Lk
c(lij, lkp) · x(lij, lkp)
Constraints need to be formulated for both the
simple and compound variables. First we want to
ensure that exactly one label lij belonging to task Ti
is selected, i.e. only one simple variable x(lij) rep-
resenting labels of a given task can be set to 1:
summationdisplay
lij∈Li
x(lij) = 1, ∀i ∈ {1, ..., n}
We also require that if two simple variables x(lij)
and x(lkp), modeling respectively labels lij and lkp
are set to 1, then the compound variable x(lij,lkp),
which models co-occurrence of these labels, is also
set to 1. This is done in two steps: we  rst en-
sure that if x(lij) = 1, then exactly one variable
x(lij,lkp) must also be set to 1:
x(lij) −
summationdisplay
lkp∈Lk
x(lij, lkp) = 0,
∀i, k ∈ {1, ..., n}, i < k ∧ j ∈ {1, ..., mi}
and do the same for variable x(lkp):
6In Section 5 we discuss an alternative approach which con-
siders the actual input.
7For a detailed overview of linear programming and differ-
ent types of LP problems see e.g. Nemhauser & Wolsey (1999).
138
l 11
1T
Tn
l 21
T2
l n1
l 1m1
l 2m2
l nmn
... ...
...
c(l    )11
c(l    ,l     )
11   2m
11   21
c(l    ,l    )
1m   21
c(l    ,l    )
c(l    ,l     )21   nm
c(l     ,l     )
c(l    ,l    )
21   n1
c(l     ,l    )
2m   n1
2m   nmc(l     ,l     )
c(l     )2m21c(l    )
c(l    )n1
c(l     )nm
1mc(l     )
1m   2m
Figure 2: Graph representation of the ILP model.
x(lkp) −
summationdisplay
lij∈Li
x(lij, lkp) = 0,
∀i, k ∈ {1, ..., n}, i < k ∧ p ∈ {1, ..., mk}
Finally, we constrain the values of both simple
and compound variables to be binary:
x(lij) ∈ {0, 1} ∧ x(lij, lkp) ∈ {0, 1},
∀i, k ∈ {1, ..., n} ∧ j ∈ {1, ..., mi} ∧ p ∈ {1, ..., mk}
3.2 Graphical Representation
We can represent the decision process that our ILP
model involves as a graph, with the nodes corre-
sponding to individual labels and the edges marking
the association between labels belonging to corre-
lated tasks. In Figure 2, task T1 is correlated with
task T2 and task T2 with task Tn. No correlation
exists for pair T1,Tn. Both nodes and edges are
augmented with costs. The goal is to select a sub-
set of connected nodes, minimizing the overall cost,
given that for each group of nodes T1,T2,...,Tn ex-
actly one node must be selected, and the selected
nodes, representing correlated tasks, must be con-
nected. We can see that in contrast to the pipeline
approach (cf. Figure 1), no local decisions determine
the overall assignment as the global distribution of
costs is considered.
4 Application for NL Generation Tasks
We applied the ILP model described in the previous
section to integrate different tasks in an NLG ap-
plication that we describe in detail in Marciniak &
Strube (2004). Our classi cation-based approach to
language generation assumes that different types of
linguistic decisions involved in the generation pro-
cess can be represented in a uniform way as clas-
si cation problems. The linguistic knowledge re-
quired to solve the respective classi cations is then
learned from a corpus annotated with both seman-
tic and grammatical information. We have applied
this framework to generating natural language route
directions, e.g.:
(a) Standing in front of the hotel (b) fol-
low Meridian street south for about 100
meters, (c) passing the First Union Bank
entrance on your right, (d) until you see
the river side in front of you.
We analyze the content of such texts in terms of
temporally related situations, i.e. actions (b), states
(a) and events (c,d), denoted by individual discourse
units8. The semantics of each discourse unit is fur-
ther given by a set of attributes specifying the se-
mantic frame and aspectual category of the pro-
 led situation. Our corpus of semantically anno-
tated route directions comprises 75 texts with a to-
tal number of 904 discourse units (see Marciniak &
Strube (2005)). The grammatical form of the texts
is modeled in terms of LTAG trees also represented
as feature vectors with individual features denoting
syntactic and lexical elements at both the discourse
and clause levels. The generation of each discourse
unit consists in assigning values to the respective
features, of which the LTAG trees are then assem-
bled. In Marciniak & Strube (2004) we implemented
the generation process sequentially as a cascade of
classi ers that realized incrementally the vector rep-
resentation of the generated text’s form, given the
meaning vector as input. The classi ers handled the
following eight tasks, all derived from the LTAG-
based representation of the grammatical form:
T1: Discourse Units Rank is concerned with or-
dering discourse units at the local level, i.e. only
clauses temporally related to the same parent clause
are considered. This task is further split into a series
of binary precedence classi cations that determine
the relative position of two discourse units at a time
8The temporal structure was represented as a tree, with dis-
course units as nodes.
139
Discourse Unit T3 T4 T5
Pass the First Union Bank ... null vp bare inf.
It is necessary that you pass ... null np+vp bare inf.
Passing the First Union Bank ... null vp gerund
After passing ... after vp gerund
After your passing . . . after np+vp gerund
As you pass ... as np+vp  n. pres.
Until you pass ... until np+vp  n. pres.
Until passing . . . until vp gerund
Table 1: Different realizations of tasks: Connective, Verb
Form and S Exp. Rare but correct constructions are in italics.
T :  Verb Lex6
T :  Phrase Type7
T :  Phrase Rank8
4T :  S Exp.
T :  Disc. Units Dir.2
T :  Verb Form5
T :  Disc. Units Rank1
T :  Connective3
Figure 3: Correlation network for the generation tasks. Cor-
related tasks, are connected with lines.
(e.g. (a) before (c), (c) before (d), etc.). These partial
results are later combined to determine the ordering.
T2: Discourse Unit Position speci es the position
of the child discourse unit relative to the parent one
(e.g. (a) left of (b), (c) right of (b), etc.).
T3: Discourse Connective determines the lexical
form of the discourse connective (e.g. null in (a), un-
til in (d)).
T4: S Expansion speci es whether a given dis-
course unit would be realized as a clause with the
explicit subject (i.e. np+vp expansion of the root S
node in a clause) (e.g. (d)) or not (e.g. (a), (b)).
T5: Verb Form determines the form of the main
verb in a clause (e.g. gerund in (a), (c), bare in ni-
tive in (b),  nite present in (d)).
T6: Verb Lexicalization provides the lexical form
of the main verb (e.g. stand, follow, pass, etc.).
T7: Phrase Type determines for each verb argu-
ment in a clause its syntactic realization as a noun
phrase, prepositional phrase or a particle.
T8: Phrase Rank determines the ordering of verb
arguments within a clause. As in T1 this task is split
into a number binary classi cations.
To apply the LP model to the generation problem
discussed above, we  rst determined which pairs of
tasks are correlated. The obtained network (Fig-
ure 3) is consistent with traditional analyses of the
linguistic structure in terms of adjacent but sepa-
rate levels: discourse, clause, phrase. Only a few
correlations extend over level boundaries and tasks
within those levels are correlated. As an example
consider three interrelated tasks: Connective, S Exp.
and Verb Form and their different realizations pre-
sented in Table 1. Apparently different realization
of any of these tasks can affect the overall meaning
of a discourse unit or its stylistics. It can also be seen
that only certain combinations of different forms are
allowed in the given semantic context. We can con-
clude that for such groups of tasks sequential pro-
cessing may fail to deliver an optimal assignment.
5 Experiments and Results
In order to evaluate our approach we conducted
experiments with two implementations of the ILP
model and two different pipelines (presented below).
Each system takes as input a tree structure, repre-
senting the temporal structure of the text. Individ-
ual nodes correspond to single discourse units and
their semantic content is given by respective feature
vectors. Generation occurs in a number of stages,
during which individual discourse units are realized.
5.1 Implemented Systems
We used the ILP model described in Section 3 to
build two generation systems. To obtain assignment
costs, both systems get a probability distribution for
each task from basic classi ers trained on the train-
ing data. To calculate the separation costs, modeling
the stochastic constraints on the co-occurrence of la-
bels, we considered correlated tasks only (cf. Figure
3) and applied two calculation methods, which re-
sulted in two different system implementations.
In ILP1, for each pair of tasks we computed the
joint distribution of the respective labels consider-
ing all discourse units in the training data before the
actual input was known. Such obtained joint distri-
butions were used for generating all discourse units
from the test data. An example matrix with joint dis-
tribution for selected labels of tasks Connective and
Verb Form is given in Table 2. An advantage of this
140
null and as after until T3 Connective
T5 Verb Form
0.40 0.18 0 0 0 bare inf
0 0 0 0.04 0.01 gerund
0.05 0.01 0.06 0.03 0.06  n pres
0.06 0.05 0 0 0 will inf
Table 2: Joint distribution matrix for selected labels of tasks
Connective (horizontal) and Verb Form (vertical), computed for
all discourse units in a corpus.
null and as after until T3 Connective
T5 Verb Form
0.13 0.02 0 0 0 bare inf
0 0 0 0 0 gerund
0 0 0.05 0.02 0.27  n pres
0.36 0.13 0 0 0 will inf
Table 3: Joint distribution matrix for tasks Connective and
Verb Form, considering only discourse units similar to (c): until
you see the river side in front of you, at Phi-threshold ≥ 0.8.
approach is that the computation can be done in an
of ine mode and has no impact on the run-time.
In ILP2, the joint distribution for a pair of tasks
was calculated at run-time, i.e. only after the actual
input had been known. This time we did not con-
sider all discourse units in the training data, but only
those whose meaning, represented as a feature vec-
tor was similar to the meaning vector of the input
discourse unit. As a similarity metric we used the
Phi coef cient9, and set the similarity threshold at
0.8. As can be seen from Table 3, the probability
distribution computed in this way is better suited to
the speci c semantic context. This is especially im-
portant if the available corpus is small and the fre-
quency of certain pairs of labels might be too low to
have a signi cant impact on the  nal assignment.
As a baseline we implemented two pipeline sys-
tems. In the  rst one we used the ordering of
tasks most closely resembling the conventional NLG
pipeline (see Figure 4). Individual classi ers had ac-
cess to both the semantic features, and those output
by the previous modules. To train the classi ers,
the correct feature values were extracted from the
training data and during testing the generated, and
hence possibly erroneous, values were taken. In the
9Phi is a measure of the extent of correlation between two
sets of binary variables, see e.g. Edwards (1976). To represent
multi-class features on a binary scale we applied dummy cod-
ing which transforms multi class-nominal variables to a set of
dummy variables with binary values.
other pipeline system we wanted to minimize the
error-propagation effect and placed the tasks in the
order of decreasing accuracy. To determine the or-
dering of tasks we applied the following procedure:
the classi er with the highest baseline accuracy was
selected as the  rst one. The remaining classi ers
were trained and tested again, but this time they had
access to the additional feature. Again, the classi-
 er with the highest accuracy was selected and the
procedure was repeated until all classi ers were or-
dered.
5.2 Evaluation
We evaluated our system using leave-one-out cross-
validation, i.e. for all texts in the corpus, each
text was used once for testing, and the remaining
texts provided the training data. To solve individ-
ual classi cation tasks we used the decision tree
learner C4.5 in the pipeline systems and the Naive
Bayes algorithm10 in the ILP systems. Both learn-
ing schemes yielded highest results in the respec-
tive con gurations11. For each task we applied
a feature selection procedure (cf. Kohavi & John
(1997)) to determine which semantic features should
be taken as the input by the respective basic classi-
 ers12. We started with an empty feature set, and
then performed experiments checking classi cation
accuracy with only one new feature at a time. The
feature that scored highest was then added to the fea-
ture set and the whole procedure was repeated itera-
tively until no performance improvement took place,
or no more features were left.
To evaluate individual tasks we applied two met-
rics: accuracy, calculated as the proportion of cor-
rect classi cations to the total number of instances,
and the κ statistic, which corrects for the propor-
tion of classi cations that might occur by chance13
10Both implemented in the Weka machine learning software
(Witten & Frank, 2000).
11We have found that in direct comparison C4.5 reaches
higher accuracies than Naive Bayes but the probability distri-
bution that it outputs is strongly biased towards the winning la-
bel. In this case it is practically impossible for the ILP system
to change the classi er’s decision, as the costs of other labels
get extremely high. Hence the more balanced probability dis-
tribution given by Naive Bayes can be easier corrected in the
optimization process.
12I.e. trained using the semantic features only, with no access
to the outputs of other tasks.
13Hence the κ values obtained for tasks of different dif cul-
141
Pipeline 1 Pipeline 2 ILP 1 ILP 2
Tasks Pos. Accuracy κ Pos. Accuracy κ Accuracy κ Accuracy κ
Dis.Un. Rank 1 96.81% 90.90% 2 96.81% 90.90% 97.43% 92.66% 97.43% 92.66%
Dis.Un. Pos. 2 98.04% 89.64% 1 98.04% 89.64% 96.10% 77.19% 97.95% 89.05%
Connective 3 78.64% 60.33% 7 79.10% 61.14% 79.15% 61.22% 79.36% 61.31%
S Exp. 4 95.90% 89.45% 3 96.20% 90.17% 99.48% 98.65% 99.49% 98.65%
Verb Form 5 86.76% 77.01% 4 87.83% 78.90% 92.81% 87.60% 93.22% 88.30%
Verb Lex 6 64.58% 60.87% 8 67.40% 64.19% 75.87% 73.69% 76.08% 74.00%
Phr. Type 7 86.93% 75.07% 5 87.08% 75.36% 87.33% 76.75% 88.03% 77.17%
Phr. Rank 8 84.73% 75.24% 6 86.95% 78.65% 90.22% 84.02% 91.27% 85.72%
Phi 0.85 0.87 0.89 0.90
Table 4: Results reached by the implemented ILP systems and two baselines. For both pipeline systems, Pos. stands for the
position of the tasks in the pipeline.
(Siegel & Castellan, 1988). For end-to-end evalua-
tion, we applied the Phi coef cient to measure the
degree of similarity between the vector representa-
tions of the generated form and the reference form
obtained from the test data. The Phi statistic is sim-
ilar to κ as it compensates for the fact that a match
between two multi-label features is more dif cult to
obtain than in the case of binary features. This mea-
sure tells us how well all the tasks have been solved
together, which in our case amounts to generating
the whole text.
The results presented in Table 4 show that the ILP
systems achieved highest accuracy and κ for most
tasks and reached the highest overall Phi score. No-
tice that for the three correlated tasks that we consid-
ered before, i.e. Connective, S Exp. and Verb Form,
ILP2 scored noticeably higher than the pipeline sys-
tems. It is interesting to see the effect of sequential
processing on the results for another group of cor-
related tasks, i.e. Verb Lex, Phrase Type and Phrase
Rank (cf. Figure 3). Verb Lex got higher scores
in Pipeline2, with outputs from both Phrase Type
and Phrase Rank (see the respective pipeline posi-
tions), but the reverse effect did not occur: scores
for both phrase tasks were lower in Pipeline1 when
they had access to the output from Verb Lex, con-
trary to what we might expect. Apparently, this was
due to the low accuracy for Verb Lex which caused
the already mentioned error propagation14. This ex-
ample shows well the advantage that optimization
processing brings: both ILP systems reached much
ties can be directly compared, which gives a clear notion how
well individual tasks have been solved.
14Apparantly, tasks which involve lexical choice get low
scores with retrieval measures as the semantic content allows
typically more than one correct form
higher scores for all three tasks.
5.3 Technical Notes
The size of an LP model is typically expressed in the
number of variables and constraints. In the model
presented here it depends on the number of tasks in
T, the number of possible labels for each task, and
the number of correlated tasks. For n different tasks
with the average of m labels, and assuming every
two tasks are correlated with each other, the num-
ber of variables in the LP target functions is given
by: num(var) = n · m + 1/2 · n(n − 1) · m2
and the number of constraints by: num(cons) =
n + n · (n − 1) · m. To solve the ILP models in our
system we use lp solve, an ef cient GNU-licence
Mixed Integer Programming (MIP) solver15, which
implements the Branch-and-Bound algorithm. In
our application, the models varied in size from: 557
variables and 178 constraints to 709 variables and
240 constraints, depending on the number of ar-
guments in a sentence. Generation of a text with
23 discourse units took under 7 seconds on a two-
processor 2000 MHz AMD machine.
6 Conclusions
In this paper we argued that pipeline architectures in
NLP can be successfully replaced by optimization
models which are better suited to handling corre-
lated tasks. The ILP formulation that we proposed
extends the classi cation paradigm already estab-
lished in NLP and is general enough to accommo-
date various kinds of tasks, given the right kind of
data. We applied our model in an NLG applica-
tion. The results we obtained show that discrete
15http://www.geocities.com/lpsolve/
142
optimization eliminates some limitations of sequen-
tial processing, and we believe that it can be suc-
cessfully applied in other areas of NLP. We view
our work as an extension to Roth & Yih (2004) in
two important aspects. We experiment with a larger
number of tasks having a varying number of labels.
To lower the complexity of the models, we apply
correlation tests, which rule out pairs of unrelated
tasks. We also use stochastic constraints, which are
application-independent, and for any pair of tasks
can be obtained from the data.
A similar argument against sequential modular-
ization in NLP applications was raised by van den
Bosch et al. (1998) in the context of word pronun-
ciation learning. This mapping between words and
their phonemic transcriptions traditionally assumes
a number of intermediate stages such as morpho-
logical segmentation, graphemic parsing, grapheme-
phoneme conversion, syllabi cation and stress as-
signment. The authors report an increase in gener-
alization accuracy when the the modular decompo-
sition is abandoned (i.e. the tasks of conversion to
phonemes and stress assignment get con ated and
the other intermediate tasks are skipped). It is inter-
esting to note that a similar dependence on the inter-
mediate abstraction levels is present in such applica-
tions as parsing and semantic role labelling, which
both assume POS tagging and chunking as their pre-
ceding stages.
Currently we are working on a uniform data for-
mat that would allow to represent different NLP ap-
plications as multi-task optimization problems. We
are planning to release a task-independent Java API
that would solve such problems. We want to use this
generic model for building NLP modules that tradi-
tionally are implemented sequentially.
Acknowledgements: The work presented here
has been funded by the Klaus Tschira Foundation,
Heidelberg, Germany. The  rst author receives a
scholarship from KTF (09.001.2004).

References
Althaus, E., N. Karamanis & A. Koller (2004). Computing lo-
cally coherent discourses. In Proceedings of the 42 Annual
Meeting of the Association for Computational Linguistics,
Barcelona, Spain, July 21-26, 2004, pp. 399 406.
Buchholz, S., J. Veenstra & W. Daelemans (1999). Cascaded
grammatical relation assignment. In Joint SIGDAT Confer-
ence on Empirical Methods in Natural Language Processing
and Very Large Corpora, College Park, Md., June 21-22,
1999, pp. 239 246.
Chekuri, C., S. Khanna, J. Naor & L. Zosin (2001). Approx-
imation algorithms for the metric labeling problem via a
new linear programming formulation. In Proceedings of the
12th Annual ACM SIAM Symposium on Discrete Algorithms,
Washington, DC, pp. 109 118.
Cunningham, H., K. Humphreys, Y. Wilks & R. Gaizauskas
(1997). Software infrastructure for natural language process-
ing. In Proceedings of the Fifth Conference on Applied Natu-
ral Language Processing Washington, DC, March 31 - April
3, 1997, pp. 237 244.
Daelemans, W. & A. van den Bosch (1998). Rapid develop-
ment of NLP modules with memory-based learning. In Pro-
ceedings of ELSNET in Wonderland. Utrecht: ELSNET, pp.
105 113.
Edwards, Allen, L. (1976). An Introduction to Linear Regres-
sion and Correlation. San Francisco, Cal.: W. H. Freeman.
Goodman, L. A. & W. H. Kruskal (1972). Measures of asso-
ciation for cross-classi cation, iv. Journal of the American
Statistical Association, 67:415 421.
Kleinberg, J. M. & E. Tardos (2000). Approximation algorithms
for classi cation problems with pairwise relationships: Met-
ric labeling and Markov random  elds. Journal of the ACM,
49(5):616 639.
Kohavi, R. & G. H. John (1997). Wrappers for feature subset
selection. Arti cial Intelligence Journal, 97:273 324.
Marciniak, T. & M. Strube (2004). Classi cation-based gen-
eration using TAG. In Proceedings of the 3rd International
Conference on Natural Language Generation, Brockenhurst,
UK, 14-16 July, 2004, pp. 100 109.
Marciniak, T. & M. Strube (2005). Modeling and annotating the
semantics of route directions. In Proceedings of the 6th In-
ternational Workshop on Computational Semantics, Tilburg,
The Netherlands, January 12-14, 2005, pp. 151 162.
Nemhauser, G. L. & L. A. Wolsey (1999). Integer and combi-
natorial optimization. New York, NY: Wiley.
Punyakanok, V., D. Roth, W. Yih & Z. Dav (2004). Semantic
role labeling via integer linear programming inference. In
Proceedings of the 20th International Conference on Com-
putational Linguistics, Geneva, Switzerland, August 23-27,
2004, pp. 1346 1352.
Reiter, E. (1994). Has a consensus NL generation architecture
appeared, and is it psycholinguistically plausible? In Pro-
ceedings of the 7th International Workshop on Natural Lan-
guage Generation, Kennebunkport, Maine, pp. 160 173.
Reiter, E. & R. Dale (2000). Building Natural Language Gener-
ation Systems. Cambridge, UK: Cambridge University Press.
Roth, D. & W. Yih (2004). A linear programming formulation
for global inference in natural language tasks. In Proceed-
ings of the 8th Conference on Computational Natural Lan-
guage Learning, Boston, Mass., May 2-7, 2004, pp. 1 8.
Siegel, S. & N. J. Castellan (1988). Nonparametric Statistics
for the Behavioral Sciences. New York, NY: McGraw-Hill.
Soon, W. M., H. T. Ng & D. C. L. Lim (2001). A machine
learning approach to coreference resolution of noun phrases.
Computational Linguistics, 27(4):521 544.
van den Bosch, A., T. Weijters & W. Daelemans (1998). Modu-
larity in inductively-learned word pronunciation systems. In
D. Powers (Ed.), Proceedings of NeMLaP3/CoNLL98, pp.
185 194.
Witten, I. H. & E. Frank (2000). Data Mining - Practical Ma-
chine Learning Tools and Techniques with Java Implementa-
tions. San Francisco, Cal.: Morgan Kaufmann.
