Evaluating the Portability of Revision Rules 
for Incremental Summary Generation 
Jacques Robin 
http://www.di.ufpe.br/~jr 
jr@di.ufpe.br 
Departamento de Inform£tica, Universidade Federal de Pernambuco 
Caixa Postal 7851, Cidade Universit£ria 
Recife, PE 50732-970 Brazil 
Abstract 
This paper presents a quantitative evalu- 
ation of the portability to the stock mar- 
ket domain of the revision rule hierarchy 
used by the system STREAK to incremen- 
tally generate newswire sports summaries. 
The evaluation consists of searching a test 
corpus of stock market reports for sentence 
pairs whose (semantic and syntactic) struc- 
tures respectively match the triggering con- 
dition and application result of each revi- 
sion rule. The results show that at least 
59% of all rule classes are fully portable, 
with at least another 7% partially portable. 
1 Introduction 
The project STREAK 1 focuses on the specific issues 
involved in generating short, newswire style, natural 
language texts that summarize vast amount of in- 
put tabular data in their historical context. A series 
of previous publications presented complementary 
aspects of this project: motivating corpus analysis 
in (Robin and McKeown, 1993), new revision-based 
text generation model in (Robin, 1993), system im- 
plementation and rule base in (Robin, 1994a) and 
empirical evaluation of the robustness and scalabil- 
ity of this new model as compared to the traditional 
single pass pipeline model in (Robin and McKeown, 
1995). The present paper completes this series by 
describing a second, empirical, corpus-based evalu- 
ation, this time quantifying the portability to an- 
other domain (the stock market) of the revision rule 
hierarchy acquired in the sports domain and imple- 
mented in STREAK. The goal of this paper is twofold: 
(1) assessing the generality of this particular rule hi- 
erarchy and (2) providing a general, semi-automatic 
1 Surface Text Reviser Expressing Additional 
Knowledge. 
methodology for evaluating the portability of seman- 
tic and syntactic knowledge structures used for nat- 
ural language generation. The results reveal that at 
least 59% of the revision rule hierarchy abstracted 
from the sports domain could also be used to incre- 
mentally generate the complex sentences observed in 
a corpus of stock market reports. 
I start by providing the context of the evalua- 
tion with a brief overview of STREAK's revision-based 
generation model, followed by some details about the 
empirical acquisition of its revision rules from cor- 
pus data. I then present the methodology of this 
evaluation, followed by a discussion of its quantita- 
tive results. Finally, I compare this evaluation with 
other empirical evaluations in text generation and 
conclude by discussing future directions. 
2 An overview of STREAK 
The project STREAK was initially motivated by an- 
alyzing a corpus of newswire summaries written by 
professional sportswriters 2. This analysis revealed 
four characteristics of summaries that challenge the 
capabilities of previous text generators: concise lin- 
guistic forms, complex sentences, optional and back- 
ground facts opportunistically slipped as modifiers 
of obligatory facts and high paraphrasing power. By 
greatly increasing the number of content planning 
and linguistic realization options that the genera- 
tor must consider, as well as the mutual constraints 
among them, these characteristics make generating 
summaries in a single pass impractical. 
The example run given in Fig. 1 illustrates how 
STREAK overcomes these difficulties. It first gener- 
ates a simple draft sentence that contains only the 
obligatory facts to include in any game report (lo- 
cation, date, game result and key player statistic). 
It then applies a series of revision rules 3, each one 
2This 800,000 word corpus covers a whole NBA (Na- 
tional Basketball Association) season. 
3In Fig. 1, the nile used is indicated above each re- 
205 
1. Initial draft (basic sentence pattern): 
"Dallas, TX - Charles Barkley scored 42 points Sunday as the Phoenix Suns defeated the Dallas Mavericks 
123-97." 
2. Adjunctization of Created into Instrument: 
"Dallas, TX - Charles Barkley tied a season high wlth 42 points Sunday as the Phoenix Suns defeated the 
Dallas Mavericks 123-97." 
3. Coordinative Conjoin of Clause: 
"Dallas, TX - Charles Barkley tied a season high with 42 points and Danny A|nge added 21 Sunday as the 
Phoenix Suns defeated the Dallas Mavericks 123-97." 
4. Absorb of Clause in Clause as Result with Agent Control: 
"Dallas, TX - Charles Barkley tied a season high with 42 points and Danny Ainge came oIT the bench 
to add 21 Sunday as the Phoenix Suns defeated the Dallas Mavericks 123-97." 
5. l~ominalization with 0rdinal Adjoin: 
"Dallas, TX - Charles Barldey tied a season high with 42 points and Danny Ainge came off the bench to add 
21 Sunday as the Phoenix Suns handed the Dallas Mavericks their 13th straight home defeat 123-97." 
6. Adjoin of Classifier to NP: 
"Dallas, TX - Charles Barkley tied a season high with 42 points and Danny Ainge came off the bench to add 
21 Sunday as the Phoenix Suns handed the Dallas Mavericks their league worst 13th straight home defeat 
123-97." 
Figure 1: Complex sentence generation through incremental revisions in STREAK 
opportunistically adding a new fact 4 that either: 
• Complements an Mready included fact (e.g., re- 
vision of sentence 2 into 3). 
• Justifies its relevance by providing its historical 
background (e.g., revision of sentence 1 into 2). 
Some of these revisions are non-monotonic, 
rewording 5 a draft fact to more concisely accommo- 
date the additional fact (e.g., revision of sentence 
1 into 2). Popping additional facts from a prior- 
ity stack, STREAK stops revising when the summary 
vised sentence. 
4Highlighted in bold in Fig. 1. 
5In Fig. 1, words that get deleted are italicized and 
words that get modified are underlined. 
Charles Barldey scored 42 points. Those 42 points equal 
his best scoring performance of the season. Danny Ainge 
is a teammate of Barkley. They play for the Phoenix 
Suns. Ainge is a reserve player. Yet he scored 21 points. 
The high scoring performances by Barkley and Ainge 
helped the Suns defeat the Dallas Mavericks. The Mav- 
ericks played on their homecourt in Texas. They had 
already lost their 12 previous games there. No other 
team in the league has lost so many games in a row at 
home. The final score was 123-97. The game was played 
Sunday. 
Figure 2: Paragraph of simple sentences 
paraphrasing a single complex sentence 
sentence reaches linguistic complexity limits empir- 
icMly observed in the corpus (e.g., 50 word long or 
parse tree of depth 10). 
While STREAK generates only single sentences, 
those complex sentences convey as much information 
as whole paragraphs made of simple sentences, only 
far more fluently and concisely. This is illustrated 
by the 12 sentence paragraph 6 of Fig. 2, which para- 
phrases sentence 6 of Fig. 1. Because they express 
facts essentially independently of one another, such 
multi-sentence paragraphs are much easier to gener- 
ate than the complex single sentences generated by 
STREAK. 
3 Acquiring revision rules from 
corpus data 
The rules driving the revision process in STREAK 
were acquired by reverse engineering 7 about 300 cor- 
pus sentences. These sentences were initially classi- 
fied in terms of: 
• The combination of domain concepts they ex- 
pressed. 
• The thematic role and top-level syntactic cate- 
gory used for each of these concepts. 
6This paragraph was not generated by STREAK, it is 
shown here only for contrastive purposes. 
v i.e., analyzing how they could be incrementally gen- 
erated through gradual revisions. 
206 
The resulting classes, called realization patterns, 
abstract the mapping from semantic to syntactic 
structure by factoring out lexical material and syn- 
tactic details. Two examples of realization patterns 
are given in Fig. 3. Realization patterns were then 
grouped into surface decrement pairs consisting of: 
• A more complex pattern (called the target pat- 
tern). 
• A simpler pattern (called the source pattern) 
that is structurally the closest to the target pat- 
tern among patterns with one less concept s . 
The structural transformations from source to tar- 
get pattern in each surface decrement pair were 
then hierarchically classified, resulting in the revi- 
sion rule hierarchy shown in Fig. 4-10. For ex- 
ample, the surface decrement pair < R~, R 1 >, 
shown in Fig. 3, is one of the pairs from which 
the revision rule Adjunctization of Range into 
Instrument, shown in Fig. 10 was abstracted. 
It involves displacing the Range argument of the 
source clause as an Instrument adjunct to accom- 
modate a new verb and its argument. This revi- 
sion rule is a sibling of the rule Adjunctization of 
Created into Instrument used to revise sentence 
i into 2 in STREAK'S run shown in Fig. 1 (where the 
Created argument role "42 points" of the verb "to 
score" in I becomes an Instrument adjunct in 2). 
The bottom level of the revision rule hierarchy 
specifies the side revisions that are orthogonal and 
sometimes accompany the restructuring revisions 
discussed up to this point. Side revisions do not 
make the draft more informative, but instead im- 
prove its style, conciseness and unambiguity. For ex- 
ample, when STREAK revises sentence (3) into (4) in 
the example run of Fig. 1, the Agent of the absorbed 
clause "Danny Ainge added 21 points" becomes con- 
trolled by the new embedding clause "Danny Ainge 
came off the bench" to avoid the verbose form: 
? "Danny Ainge came off the bench for Danny Ainge 
to add 21 points". 
4 Evaluation methodology 
In the spectrum of possible evaluations, the eval- 
uation presented in this paper is characterized as 
follows: 
• Its object is the revision rule hierarchy acquired 
from the sports summary corpus. It thus does 
not directly evaluate the output of STREAK, but 
rather the special knowledge structures required 
by its underlying revision-based model. 
s i.e., the source pattern expresses the same concept 
combination than the target pattern minus one concept. 
The particular property of this revision rule hi- 
erarchy that is evaluated is cross-domain porta- 
bility: how much of it could be re-used to gener- 
ate summaries in another domain, namely the 
stock market? 
The basis for this evaluation is corpus data 9. 
The original sports summary corpus from which 
the revision rules were acquired is used as the 
'training' (or acquisition) corpus and a cor- 
pus of stock market reports taken from several 
newswires is used as the 'test' corpus. This test 
corpus comprises over 18,000 sentences. 
The evaluation procedure is quantitative, mea- 
suring percentages of revision rules whose target 
and source realization patterns are observable 
in the test corpus. It is also semi-automated 
through the use of the corpus search tool CREP 
(Duford, 1993) (as explained below). 
Basic principle As explained in section 3, a re- 
vision rule is associated with a list of surface decre- 
ment pairs, each one consisting of: 
A source pattern whose content and linguistic 
form match the triggering conditions of the rule 
(e.g., R~ in Fig. 3 for the rule Adjunctization 
of Range into Instrument). 
A target pattern whose content and linguis- 
tic form can be derived from the source pat- 
tern by applying the rule (e.g., R 2 in Fig. 3 
for the rule Adjunctization of Range into 
Instrument). 
This list of decrement pairs can thus be used as 
the signature of the revision rule to detect its usage 
in the test corpus. The needed evidence is the simul- 
taneous presence of two test corpus sentences 1° , each 
one respectively matching the source and target pat- 
terns of at least one element in this list. Requiring 
occurrence of the source pattern in the test corpus is 
necessary for the computation of conservative porta- 
bility estimates: while it may seem that one target 
pattern alone is enough evidence, without the pres- 
ence of the corresponding source pattern, one cannot 
rule out the possibility that, in the test domain, this 
target pattern is either a basic pattern or derived 
from another source pattern using another revision 
rule. 
9Only the corpus analysis was performed for both do- 
mains. The implementation was not actually ported to 
the stock market domain. 
1°In general, not from the same report. 
207 
Realization pattern R~: 
• Expresses the concept pair: 
<game-result(winner,loser,score), streak(winner,aspect,result-type,lengt h) >. 
• Is a target pattern of the revision rule Adjunctization of Range into Instrument. 
winner aspect l type l streak length \[ 
agent action affected/located location 
proper verb NP PP 
det\] classifier I noun prep \] 
Utah extended its win streak to 6 games with 
Boston stretching its winning spree to 9 outings with 
\[ score \]game-result \[ loser 
instrument 
PP 
NP 
L_J numbe~J PP 
a 99-84 triumph - over enl3-d-~V-~ 
a 118-94 rout of Utah 
Realization pattern R~: 
* Expresses the single concept <game-result(winner,loser,score)>. 
• Is a source pattern of the revision rule Adjunctization of Range into Instrument. 
• Is a surface decrement of pattern R~ above. 
winner 
agent action 
proper support-verb 
I score \] game-result I loser 
range 
NP 
det I number I noun 
Chicago claimed a Y 
Orlando recorded a 101-95 triumph 
I PP 
over New Jersey 
against New York 
Figure 3: Realization pattern examples 
Partially automating the evaluation The soft- 
ware tool CREP 11 was developed to partially auto- 
mate detection of realization patterns in a text cor- 
pus. The basic idea behind CREP is to approximate 
a realization pattern by a regular expression whose 
terminals are words or parts-of-speech tags (POS- 
tags). CR~.P will then automatically retrieve the cor- 
pus sentences matching those expressions. For ex- 
ample, the CREP expression C~1 below approximates 
the realization pattern R~ shown in Fig. 3: 
(61) TEAM Of (claimed\[recorded)@VBD I- SCORE O= 
(victory\[triumph)@NN O= (over\[against)@IN O= TEAM 
In the expression above, 'VBD', 'NN' and 'IN' are the 
POS-tags for past verb, singular noun and preposi- 
tion (respectively), and the sub-expressions 'TEAH' 
and 'SCORE' (whose recursive definitions are not 
shown here) match the team names and possible fi- 
nal scores (respectively) in the NBA. The CREP op- 
erators 'N=' and 'N-' (N being an arbitrary integer) 
respectively specify exact and minimal distance of 
N words, and 'l' encodes disjunction. 
llcREP was implemented (on top of FLEX, GNUS' ver- 
sion of LEX) and to a large extent also designed by Du- 
ford. It uses Ken Church's POS tagger. 
Because a realization pattern abstracts away from 
lexical items to capture the mapping from concepts 
to syntactic structure, approximating such a pattern 
by a regular expression of words and POS-tags in- 
volves encoding each concept of the pattern by the 
disjunction of its alternative lexicalizations. In a 
given domain, there are therefore two sources of in- 
accuracy for such an approximation: 
• Lexical ambiguity resulting in false positives by 
over-generalization. 
• Incomplete vocabulary resulting in false nega- 
tives by over-specialization 12. 
Lexical ambiguities can be alleviated by writing 
more context-sensitive expressions. The vocabu- 
lary can be acquired through additional exploratory 
CREP runs with expressions containing wild-cards 
for some concept slots. Although automated corpus 
search using CREP expressions considerably speeds- 
up corpus analysis, manual intervention remains 
12This is the case for example of C1 above, which is a 
simplification of the actual expression that was used to 
search occurrences of R~ in the test corpus (e.g., Cz is 
missing "win" and "rout" as alternatives for "victory"). 
208 
necessary to filter out incorrect matches resulting 
from imperfect approximations. 
Cross-domain discrepancies Basic similarities 
between the finance and sports domains form the 
basis for the portability of the revision rules. In 
both domains, the core facts reported are statis- 
tics compiled within a standard temporal unit (in 
sports, one ballgame; in finance, one stock market 
session) together with streaks 13 and records com- 
piled across several such units. This correspondence 
is, however, imperfect. Consequently, before they 
can track down usage of a revision rule in the test do- 
main, the CREP expressions approximating the sig- 
nature of the rule in the acquisition domain must be 
adjusted for cross-domain discrepancies to prevent 
false negatives. Two major types of adjustments are 
necessary: lexical and thematic. 
Lexical adjustments handle cases of partial mis- 
match between the respective vocabularies used to 
lexicalize matching conceptual structures in each do- 
main. (e.g.,, the verb "to rebound from" expresses 
the interruption of a streak in the stock market do- 
main, while in the basketball domain "to break" or 
"to snap" are preferred since "to rebound" is used to 
express a different concept). 
Thematic adjustments handle cases of partial dif- 
ferences between corresponding conceptual struc- 
tures in the acquisition and test domains. For ex- 
ample, while in sports garae-result involves an- 
tagonistic teams, its financial domain counterpart 
session-result concerns only a single indicator. 
Consequently, the sub-expression for the loser role 
in the example CI:tEP expression (~1 shown before, 
and which approximates realization pattern /~ for 
game-resull; (shown in Fig. 3), needs to become 
optional in order to also approximate patterns for 
session-resul~. This is done using the CREP op- 
erator ? as shown below: 
(C1/): TEAM O= (claimedlrecorded)@VBD 1- 
SCORE O= (victoryltriumph) @NN O= 
( (over\] against)@IN 0= TEAM) ? 
Note that it is the CREP expressions used to auto- 
matically retrieve test corpus sentence pairs attest- 
ing usage of a revision rule that require this type 
of adjustment and not the revision rule itself 14. For 
example, the Adjoin of Frequency PP to Clause 
revision rule attaches a streak to a session-result 
clause without loser role in exactly the same way 
than it attaches a streak to a game-result with 
13i.e., series of events with similar outcome. 
14Some revision rules do require adjustment, but of 
another type (cfl Sect. 5). 
loser role. This is illustrated by the two corpus 
sentences below: 
P~: "The Chicago Bulls beat the Phoenix Suns 99 
91 for their 3rd straight win" 
pt: "The Amex Market Value Index inched up 0.16 
to 481.94 for its sixth straight advance" 
Detailed evaluation procedure The overall 
procedure to test portability of a revision rule con- 
sists of considering the surface decrement pairs in the 
rule signature in order, and repeating the following 
steps: 
1. Write a CREP expression for the acquisition tar- 
get pattern. 
2. Iteratively delete, replace or generalize sub- 
expressions in the CREP expression - to gloss 
over thematic and lexical discrepancies between 
the acquisition and test domains, and prevent 
false negatives - until it matches some test cor- 
pus sentence(s). 
3. Post-edit the file containing these matched sen- 
tences. If it contains only false positives of the 
sought target pattern, go back to step 2. Oth- 
erwise, proceed to step 4. 
4. Repeat step (1-3) with the source pattern of the 
pair under consideration. If a valid match can 
also be found for this source pattern, stop: the 
revision rule is portable. Otherwise, start over 
from step 1 with the next surface decrement pair 
in the revision rule signature. If there is no next 
pair left, stop: the revision rule is considered 
non-portable. 
Steps (2,3) constitute a general, generate-and-test 
procedure to detect realization patterns usage in a 
corpus 15. Changing one CKEP sub-expression may 
result in going from too specific an expression with 
no valid match to either: (1) a well-adjusted ex- 
pression with a valid match, (2) still too specific an 
expression with no valid match, or (3) already too 
general an expression with too many matches to be 
manually post-edited. 
It is in fact always possible to write more context- 
sensitive expressions, to manually edit larger no- 
match files, or even to consider larger test corpora in 
the hope of finding a match. At some point however, 
one has to estimate, guided by the results of previ- 
ous runs, that the likelihood of finding a match is too 
15And since most generators rely on knowledge struc- 
tures equivalent to realization patterns, this procedure 
can probably be adapted to semi-automatically evaluate 
the portability of virtually any corpus-based generator. 
209 
small to justify the cost of further attempts. This is 
why the last line in the algorithm reads "considered 
non-portable" as opposed to "non-portable". The 
algorithm guarantees the validity of positive (i.e., 
portable) results only. Therefore, the figures pre- 
sented in the next section constitute in fact a lower- 
bound estimate of the actual revision rule portability. 
5 Evaluation results 
The results of the evaluation are summarized in 
Fig. 4-10. They show the revision rule hierarchy, 
with portable classes highlighted in bold. The fre- 
quency of occurrence of each rule in the acquisition 
corpus is given below the leaves of the hierarchy. 
Some rules are same-concept portable: they are 
used to attach corresponding concepts in each do- 
main (e.g., Adjoin of Frequency PP to Clause, 
as explained in Sect. 4). They could be re-used "as 
is" in the financial domain. Other rules, however, 
are only different-concept portable: they are used to 
attach altogether different concepts in each domain. 
This is the case for example of Adjoin Finite Time 
Clause to Clause, as illustrated by the two corpus 
sentences below, where the added temporal adjunct 
(in bold) conveys a streak in the sports sentence, but 
a complementary statistics in the financial one: 
T~: "to lead Utah to a 119-89 trouncing of Denver 
as the Jazz defeated the Nuggets for the 12th 
straight time at home." 
T~: "Volume amounted to a solid 349 million shares 
as advances out-paced declines 299 to 218.". 
For different-concept portable rules, the left-hand 
side field specifying the concepts incorporable to the 
draft using this rule will need to be changed when 
porting the rule to the stock market domain. In 
Fig. 4-10, the arcs leading same-concept portable 
classes are full and thick, those leading to different- 
concept portable classes are dotted, and those lead- 
ing to a non-portable classes are full but thin. 
59% of all revision rule classes turned out to be 
same-concept portable, with another 7% different- 
concept portable. Remarkably, all eight top-level 
classes identified in the sports domain had instances 
same-concept portable to the financial domain, even 
those involving the most complex non-monotonic re- 
visions, or those with only a few instances in the 
sports corpus. Among the bottom-level classes that 
distinguish between revision rule applications in very 
specific semantic and syntactic contexts, 42% are 
same-concept portable with another 10% different- 
concept portable. Finally, the correlation between 
high usage frequency in the acquisition corpus and 
portability to the test corpus is not statistically sig- 
nificant (i.e., the hypothesis that the more common 
a rule, the more likely it is to be portable could not 
be confirmed on the analyzed sample). See (Robin, 
1994b) for further details on the evMuation results. 
There are two main stumbling blocks to porta- 
bility: thematic role mismatch and side revisions. 
Thematic role mismatches are cases where the se- 
mantic label or syntactic sub-category of a con- 
stituent added or displaced by the rule differ in 
each domain (e.g., Adjunctization of Created 
into Instrument vs. Adjoin of Affected into 
Instrument). They push portability from 92% down 
to 71%. Their effect could be reduced by allowing 
STREAK'S reviser to manipulate the draft down to 
the surface syntactic role level (e.g., in both cor- 
pora Created and Affected surface as object). Cur- 
rently, the reviser stops at the thematic role level to 
allow STREAK to take full advantage of the syntac- 
tic processing front-end SURGE (Elhadad and Robin, 
1996), which accepts such thematic structures as in- 
put. 
Accompanying side revisions push portability 
from 71% to 52%. This suggests that the design of 
STREAK could be improved by keeping side revisions 
separate from re-structuring revisions and interleav- 
ing the applications of the two. Currently, they are 
integrated together at the bottom of the revision rule 
hierarchy. 
6 Related work 
Apart from STREAK, only three generation projects 
feature an empirical and quantitative evaluation: 
ANA (Kukich, 1983), KNIGHT (Lester, 1993) and IM- 
AGENE (Van der Linden, 1993). 
ANA generates short, newswire style summaries of 
the daily fluctuations of several stock market indexes 
from half-hourly updates of their values. For eval- 
uation, Kukich measures both the conceptual and 
linguistic (lexical and syntactic) coverages of ANA 
by comparing the number of concepts and realiza- 
tion patterns identified during a corpus analysis with 
those actually implemented in the system. 
KNIGHT generates natural language concept defi- 
nitions from a large biological knowledge base, rely- 
ing on SURGE for syntactic realization. For evalua- 
tion, Lester performs a Turing test in which a panel 
of human judges rates 120 sample definitions by as- 
signing grades (from A to F) for: 
• Semantic accuracy (defined as "Is the definition 
adequate, providing correct information and fo- 
cusing on what's important?" in the instruc- 
tions provided to the judges). 
• Stylistic accuracy (defined as "Does the defini- 
tion use good prose and is the information it 
210 
Monotonic Revisions Non-monotonic Revisions 
Adl~join Recast Adjunctization Nomlnalization Demotion Promotion 
Figure 4: Revision rule hierarchy: top-levels 
Absorb 
of NP ofclause 
insidc-NP ~¢~'~ ....... ~d~lause inslde!lanse 
as querier it ...... "°"'~ "''"~ - as- ns rument as-affected-apposition as-mean as-co-event 
1 2 l 1 3 
Figure 5: Absorb revision rule sub-hierarchy 
Recast 
of NP of clause 
from classifier from location from range 
to qualifier to instrument to time to instrument 
10 9 1 1 
Nominalization 
+ordinal +ordinal +ordinal 
+classifier +qualifier 
1 2 2 
Figure 6: Recast and Nominalize revision rule sub-hierarchy 
Demotion 
from~ffected 
\] to qualifleJ(affeeted) I to score(co-event) to determiner(affected) 
1 2 1 
Coordination Promotion 
simple w/Adjunctization 
1 1 
Figure 7: Demotion and Promotion revision rule sub-hierarchy 
211 
Adjoin 
to NT to clause 
1 25 aSfi use4~non.fi • . • ° rel ve-cla rote clause 
abridged /~deleted ~ +reorder ref ref full abridged 2ref/''- ref~ 3 ref 9 2 2 10 
frequen~nt 
I I i I PP non-finite finite non-finite 
da e clause clause 2 , 
full I ~ abridged abridg~ ~ deleted refl~ ref ref~ ref J \ ref 
13 3 1 12 4 
Figure 8: Adjoin revision rule sub-hierarchy 
Conjoin 
NPs 
by apposition / 
%%%%% 
abridged ~deleted re~ f ref I \ ref 
15 5 4 1 
clauses 
I by coordination by coordination 
"+scope full ~bridged full Aabridged l~mark refl ~ ref refl ~ ref 
1 2 1 5 1 1 
Figure 9: Conjoin revision rule sub-hierarchy 
into opposition into instrument 
of affected of range of created of location 
1 ~ +agent 1 \demotion 
abridged / full,deleted \ full abridged ref// refJ\ ref \ ref/~ ref 
7 27 3 1 14 5 4 
Figure 10: Adjunctization revision rule sub-hierarchy 
212 
ANA 
KNIGHT 
Object of Evaluation 
knowledge structures 
output text 
IMAGENE output text 
knowledge structures 
STREAK 
Evaluated Properties 
conceptual coverage 
linguistic coverage 
semantic accuracy 
stylistic accuracy 
stylistic accuracy 
stylistic robustness 
cross-domain portability 
same-domain robustness 
same-domain scalability 
Empirical Basis 
textual corpus 
human judges 
textual corpus 
textual corpus 
Evaluation Procedure 
manual 
mallual 
manual 
semi-automatic 
Figure 11: Empirical evaluations in language generation 
conveys well organized" in the instructions pro- 
vided to the judges). 
The judges did not know that half the definitions 
were computer-generated while the other half were 
written by four human domain experts. Impres- 
sively, the results show that: 
• With respect to semantic accuracy, the human 
judges could not tell KNIGHT apart from the hu- 
man writers. 
* While as a group, humans got statistically sig- 
nificantly better grades for stylistic accuracy 
than KNIGHT, the best human writer was single- 
handly responsible for this difference. 
IMAGENE generates instructions on how to oper- 
ate household devices relying on NIGEL (Mann and 
Matthiessen, 1983) for syntactic realization. The 
implementation focuses on a very limited aspect of 
text generation: the realization of purpose relations. 
Taking as input the description of a pair <operation, 
purpose of the operation>, augmented by a set of 
features simulating the communicative context of 
generation, IMAGENE selects, among the many real- 
izations of purpose generable by NIGEL (e.g., fronted 
to-infinitive clause vs. trailing for-gerund clauses), 
the one that is most appropriate for the simulated 
context (e.g., in the context of several operations 
sharing the same purpose, the latter is preferentially 
expressed before those actions than after them). IM- 
AGENE's contextual preference rules were abstracted 
by analyzing an acquisition corpus of about 300 pur- 
pose clauses from cordless telephone manuMs. For 
evaluation, Van der Linden compares the purpose 
realizations picked by IMAGENE to the one in the 
corresponding corpus text, first on the acquisition 
corpus and then on a test corpus of about 300 other 
purpose clauses from manuals for other devices than 
cordless telephones (ranging from clock radio to au- 
tomobile). The results show a 71% match on the 
acquisition corpus 16 and a 52% match on the test 
corpus. 
The table of Fig. 11 summarizes the difference 
on both goal and methodology between the eval- 
uations carried out in the projects ANA, KNIGHT, 
IMAGENE and STREAK. In terms of goals, while 
Kukich and Lester evaluate the coverage or accu- 
racy of a particular implementation, I instead fo- 
cus on three properties inherent to the use of the 
revision-based generation model underlying STREAK: 
robustness (how much of other text samples from the 
same domain can be generated without acquiring 
new knowledge?) and scalability (how much more 
new knowledge is needed to fully cover these other 
samples?) discussed in (Robin and McKeown, 1995), 
and portability to another domain in the present pa- 
per. Van der Linden does a little bit of both by first 
measuring the stylistic accuracy of his system for a 
very restricted sub-domain, and then measuring how 
it degrades for a more general domain. 
In itself, measuring the accuracy and coverage of 
a particular implementation in the sub-domain for 
which it was designed brings little insights about 
what generation approach should be adopted in fu- 
ture work. Indeed, even a system using mere canned 
text can be very accurate and attain substantial cov- 
erage if enough hand-coding effort is put into it. 
However, all this effort will have to be entirely du- 
plicated each time the system is scaled up or ported 
to a new domain. Measuring how much of this effort 
duplication can be avoided when relying on revision- 
based generation was the very object of the three 
evaluations carried in the STREAK project. 
16This imperfect match on the acquisition corpus 
seems to result from the heuristic nature of IMAGENE's 
stylistic preferences: individually, none of them needs to 
apply to the whole corpus. 
213 
In terms of methodology, the main originality of 
these three evaluations is the use of CREP to par- 
tially automate reverse engineering of corpus sen- 
tences. Beyond evaluation, CREP is a simple, but 
general and very handy tool that should prove use- 
ful to speed-up a wide range of corpora analyses. 
7 Conclusion 
In this paper, I presented a quantitative evaluation 
of the portability to the stock market domain of the 
revision rule hierarchy used by the system STREAK to 
incrementally generate newswire sports summaries. 
The evaluation procedure consists of searching a test 
corpus of stock market reports for sentence pairs 
whose (semantic and syntactic) structures respec- 
tively match the triggering condition and application 
result of each revision rule. The results show that at 
least 59% of all rule classes are fully portable, with 
at least another 7% partially portable. 
Since the sports domain is not closer to the finan- 
cial domain than to other quantitative domains such 
as meteorology, demography, business auditing or 
computer surveillance, these results are very encour- 
aging with respect to the general cross-domain re- 
usability potential of the knowledge structures used 
in revision-based generation. However, the present 
evaluation concerned only one type of such knowl- 
edge structures: revision rules. In future work, sim- 
ilar evaluations will be needed for the other types of 
knowledge structures: content selection rules, phrase 
planning rules and lexicalization rules. 
Acknowledgements 
Many thanks to Kathy McKeown for stressing the im- 
portance of empirically evaluating STREAK. The re- 
search presented in this paper is currently supported by 
CNPq (Brazilian Research Council) under post-doctoral 
research grant 150130-95.3. It started out while I was 
at Columbia University supported by of a joint grant 
from the Office of Naval Research, by the Advanced 
Research Projects Agency under contract N00014-89-J- 
1782, by National Science Foundation Grants IRT-84- 
51438 and GER-90-2406, and by the New York State 
Science and Technology Foundation under this auspices 
of the Columbia University CAT in High Performance 
Computing and Communications in Health Care, a New 
York State Center for Advanced Technology. 

References 
Duford, D. 1993. caEP: a regular expression- 
matching textual corpus tool. Technical Report 
CU-CS-005-93. Computer Science Department, 
Columbia University, New York. 
Elhadad, M. and Robin, J. 1996. An overview 
of SURGE: a re-usable comprehensive syntactic 
realization component. Technical Report 96-03. 
Computer Science and Mathematics Department, 
Ben Gurion University, Beer Sheva, Israel. 
Kukich, K. 1983. Knowledge-based report genera- 
tion: a knowledge engineering approach to natural 
language report generation. PhD. Thesis. Depart- 
ment of Information Science. University of Pitts- 
burgh. 
Lester, J.C. 1993. Generating natural language 
explanations from large-scale knowledge bases. 
PhD. Thesis. Computer Science Department, 
University of Texas at Austin. 
Mann, W.C. and Matthiessen, C. M. 1983. NIGEL: 
a systemic grammar for text generation. Research 
Report RR-83-105. ISI. Marina Del Rey, CA. 
Robin, J. and McKeown, K.R. 1993. Corpus anal- 
ysis for revision-based generation of complex sen- 
tences. In Proceedings of the 11th National Con- 
ference on Artificial Intelligence, Washington DC. 
(AAAI'93). 
Robin, J. and McKeown, K.R. 1995. Empirically 
designing and evaluating a new revision-based 
model for summary generation. Artificial Intel- 
ligence. Vol 85. Special Issue on Empirical Meth- 
ods. North-Holland. 
Robin, J. 1993. A revision-based generation archi- 
tecture for reporting facts in their historical con- 
text. In New Concepts in Natural Language Gen- 
eration: Planning, Realization and System. Ho- 
racek, H. and Zock, M., Eds. Frances Pinter. 
Robin, J. 1994a. Automatic generation and revision 
of natural language summaries providing histori- 
cal background In Proceedings of the 11th Brazil- 
ian Symposium on Artificial Intelligence, Fort- 
aleza, Brazil. (SBIA'94). 
Robin, J. 1994b. Revision-based generation of natu- 
ral language summaries providing historical back- 
ground: corpus-based analysis, design, implemen- 
tation and evaluation. PhD. Thesis. Available 
as Technical Report CU-CS-034-94. Computer 
Science Department, Columbia University, New 
York. 
Van der Linden, K. and Martin, J.H. 1995. Ex- 
pressing rhetorical relations in instructional texts: 
a case study of the purpose relation. Computa- 
tional Linguistics, 21(1). MIT Press. 
