Corpus-Based Methods in Natural Language Generation: Friend or Foe?
Extended Abstract
Owen Rambow
AT&T Labs – Research
Florham Park, NJ, USA
rambow@research.att.com
In computational linguistics, the 1990s were
characterized by the rapid rise to prominence of
corpus-based methods in natural language under-
standing (NLU). These methods include statis-
tical and machine-learning and approaches. In
natural language generation (NLG), in the mean
time, there was little work using statistical and
machine learning approaches. Some researchers
felt that the kind of ambiguities that appeared to
profit from corpus-based approaches in NLU did
not exist in NLG: if the input is adequately speci-
fied, then all the rules that map to a correct out-
put can also be explicitly specified. However,
this paper will argue that this view is not cor-
rect, and NLG can and does profit from corpus-
based methods. The resistance to corpus-based
approaches in NLG may have more to do with the
fact that in many NLG applications (such as re-
port or description generation) the output to be
generated is extremely limited. As is the case
with NLU, if the language is limited, hand-crafted
methods are adequate and successful. Thus, it is
not a surprise that the first use of corpus-based
techniques, at ISI (Knight and Hatzivassiloglou,
1995; Langkilde and Knight, 1998) was moti-
vated by the use of NLG not in “traditional” NLG
applications, but in machine translation, in which
the range of output language is (potentially) much
larger.
In fact, the situations in NLU and NLG do
not actually differ with respect to the notion of
ambiguity. Though it is not a trivial task, we
can fully specify a grammar such that the gen-
erated text is not ungrammatical. But the prob-
lem for NLG is not specifying a grammar, but de-
termining which part of the grammar to use: to
give a simple example, a give-event can be gen-
erated with the double-object frame (give Mary a
book) or with a prepositional object (give a book
to Mary). We can easily specify the syntax of
these two constrictions. What we need to know
is when to choose which. But the situation is ex-
actly the same in NLU: the problem is knowing
which grammar rules to use when during analy-
sis. Thus, just as the mapping from input to output
is ambiguous in NLU, it is ambiguous in NLG,
not because the grammar is wrong, but because it
leaves too many options. The difference is that in
NLG, different outputs differ not in whether they
are correct (as is the case in NLU), but in whether
they are appropriate or felicitous in a given con-
text. Thus, the need for corpus-based approaches
is less apparent.
Determining which linguistic forms are appro-
priate in what contexts is a hard task. The intro-
spective grammaticality judgment that (perhaps)
is legitimate in the study of syntax is method-
ologically suspect in the study of language use in
context, and most work in linguistic pragmatics
is in fact corpus-based, such as Prince’s work us-
ing the Watergate transcripts and similar corpora
(Prince, 1981). Thus, it is clear that the role of
corpus-based methods in NLG is not to displace
traditional methods, but rather to accelerate them.
If indeed corpus-based methods are necessary in
any case, we may as well use automated proce-
dures for discovering regularities; we no longer
need to use multi-colored pencils to mark up pa-
per copies. For the researcher, there is enough left
to do: the corpus-based techniques still require
linguistic research in order to determine which
features to code for (i.e., what linguistic phenom-
ena to count). To the extent that corpus-based
methods fail currently, it is largely because we are
substituting easily codable features for those that
are more difficult to code, or because we are sim-
ply coding the wrong features. It is not because
there is some hidden truth which traditional lin-
guistic methodologies have access to but corpus-
based methods do not, because they are not in fact
in opposition to each other.
Finally, the emphasis on evaluation that the
corpus-based techniques in NLU have brought
with them have often aroused animosity in the
NLG community. Evaluation is necessary for
development purposes when using corpus-based
techniques: it is easy to generate many differ-
ent hypotheses, and we need to be able to choose
among them. Since this is crucial, increased at-
tention needs to be paid to evaluation in gen-
eration (Bangalore et al., 2000; Rambow et al.,
2001). But again, the situation is in fact not dif-
ferent from a traditional linguistic methodology:
theories about language use in context need to be
defeasible on empirical grounds and hence need
to be evaluated against a corpus. Of course, the
choice of evaluation corpus is an important one,
and the costs associated with compiling and an-
notating corpora can greatly impact the choice of
evaluation corpus and hence the evaluation.
In conclusion, NLG has nothing to fear from
corpus-based methods. Instead, the NLG com-
munity can continue to provide a test-bed for lin-
guists to exercise their theories (to a much greater
extent than can NLU). The difference is that ev-
eryone can now start using computers.

References
Srinivas Bangalore, Owen Rambow, and Steve Whit-
taker. 2000. Evaluation metrics for generation. In
Proceedings of the First International Natural Lan-
guage Generation Conference (INLG2000), Mitzpe
Ramon, Israel.
K. Knight and V. Hatzivassiloglou. 1995. Two-level,
many-paths generation. In 33rd Meeting of the As-
sociation for Computational Linguistics (ACL’95).
Irene Langkilde and Kevin Knight. 1998. The prac-
tical value of n-grams in generation. In Proceed-
ings of the Ninth International Natural Language
Generation Workshop (INLG’98), Niagara-on-the-
Lake, Ontario.
Ellen F. Prince. 1981. Topicalization, focus move-
ment and yiddish movement: A pragmatic dif fer-
entiation. In D. Alford, editor, Proceedings of the
Seventh Annual Meeting of the Berkely Linguistics
Society, pages 249–264. BLS.
Owen Rambow, Monica Rogati, and Marilyn Walker.
2001. Evaluating a trainable sentence planner for a
spoken dialogue system. In 39th Meeting of the As-
sociation for Computational Linguistics (ACL’01),
Toulouse, France.
