File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/96/p96-1028_metho.xml

Size: 27,502 bytes

Last Modified: 2025-10-06 14:14:18

<?xml version="1.0" standalone="yes"?>
<Paper uid="P96-1028">
  <Title>Evaluating the Portability of Revision Rules for Incremental Summary Generation</Title>
  <Section position="3" start_page="0" end_page="205" type="metho">
    <SectionTitle>
2 An overview of STREAK
</SectionTitle>
    <Paragraph position="0"> The project STREAK was initially motivated by analyzing a corpus of newswire summaries written by professional sportswriters 2. This analysis revealed four characteristics of summaries that challenge the capabilities of previous text generators: concise linguistic forms, complex sentences, optional and background facts opportunistically slipped as modifiers of obligatory facts and high paraphrasing power. By greatly increasing the number of content planning and linguistic realization options that the generator must consider, as well as the mutual constraints among them, these characteristics make generating summaries in a single pass impractical.</Paragraph>
    <Paragraph position="1"> The example run given in Fig. 1 illustrates how STREAK overcomes these difficulties. It first generates a simple draft sentence that contains only the obligatory facts to include in any game report (location, date, game result and key player statistic).</Paragraph>
    <Paragraph position="2"> It then applies a series of revision rules 3, each one  1. Initial draft (basic sentence pattern): &amp;quot;Dallas, TX - Charles Barkley scored 42 points Sunday as the Phoenix Suns defeated the Dallas Mavericks 123-97.&amp;quot; 2. Adjunctization of Created into Instrument: &amp;quot;Dallas, TX - Charles Barkley tied a season high wlth 42 points Sunday as the Phoenix Suns defeated the Dallas Mavericks 123-97.&amp;quot; 3. Coordinative Conjoin of Clause: &amp;quot;Dallas, TX - Charles Barkley tied a season high with 42 points and Danny A|nge added 21 Sunday as the Phoenix Suns defeated the Dallas Mavericks 123-97.&amp;quot; 4. Absorb of Clause in Clause as Result with Agent Control: &amp;quot;Dallas, TX - Charles Barkley tied a season high with 42 points and Danny Ainge came oIT the bench to add 21 Sunday as the Phoenix Suns defeated the Dallas Mavericks 123-97.&amp;quot; 5. l~ominalization with 0rdinal Adjoin: &amp;quot;Dallas, TX - Charles Barldey tied a season high with 42 points and Danny Ainge came off the bench to add 21 Sunday as the Phoenix Suns handed the Dallas Mavericks their 13th straight home defeat 123-97.&amp;quot; 6. Adjoin of Classifier to NP: &amp;quot;Dallas, TX - Charles Barkley tied a season high with 42 points and Danny Ainge came off the bench to add 21 Sunday as the Phoenix Suns handed the Dallas Mavericks their league worst 13th straight home defeat  words that get modified are underlined.</Paragraph>
    <Paragraph position="3"> Charles Barldey scored 42 points. Those 42 points equal his best scoring performance of the season. Danny Ainge is a teammate of Barkley. They play for the Phoenix Suns. Ainge is a reserve player. Yet he scored 21 points.</Paragraph>
    <Paragraph position="4"> The high scoring performances by Barkley and Ainge helped the Suns defeat the Dallas Mavericks. The Mavericks played on their homecourt in Texas. They had already lost their 12 previous games there. No other team in the league has lost so many games in a row at home. The final score was 123-97. The game was played Sunday.</Paragraph>
    <Paragraph position="5">  paraphrasing a single complex sentence sentence reaches linguistic complexity limits empiricMly observed in the corpus (e.g., 50 word long or parse tree of depth 10).</Paragraph>
    <Paragraph position="6"> While STREAK generates only single sentences, those complex sentences convey as much information as whole paragraphs made of simple sentences, only far more fluently and concisely. This is illustrated by the 12 sentence paragraph 6 of Fig. 2, which paraphrases sentence 6 of Fig. 1. Because they express facts essentially independently of one another, such multi-sentence paragraphs are much easier to generate than the complex single sentences generated by STREAK.</Paragraph>
  </Section>
  <Section position="4" start_page="205" end_page="206" type="metho">
    <SectionTitle>
3 Acquiring revision rules from
</SectionTitle>
    <Paragraph position="0"> corpus data The rules driving the revision process in STREAK were acquired by reverse engineering 7 about 300 corpus sentences. These sentences were initially classified in terms of:  shown here only for contrastive purposes. v i.e., analyzing how they could be incrementally generated through gradual revisions.</Paragraph>
    <Paragraph position="1">  The resulting classes, called realization patterns, abstract the mapping from semantic to syntactic structure by factoring out lexical material and syntactic details. Two examples of realization patterns are given in Fig. 3. Realization patterns were then grouped into surface decrement pairs consisting of: * A more complex pattern (called the target pattern). null * A simpler pattern (called the source pattern) that is structurally the closest to the target pattern among patterns with one less concept s . The structural transformations from source to target pattern in each surface decrement pair were then hierarchically classified, resulting in the revision rule hierarchy shown in Fig. 4-10. For example, the surface decrement pair &lt; R~, R 1 &gt;, shown in Fig. 3, is one of the pairs from which the revision rule Adjunctization of Range into Instrument, shown in Fig. 10 was abstracted.</Paragraph>
    <Paragraph position="2"> It involves displacing the Range argument of the source clause as an Instrument adjunct to accommodate a new verb and its argument. This revision rule is a sibling of the rule Adjunctization of Created into Instrument used to revise sentence i into 2 in STREAK'S run shown in Fig. 1 (where the Created argument role &amp;quot;42 points&amp;quot; of the verb &amp;quot;to score&amp;quot; in I becomes an Instrument adjunct in 2). The bottom level of the revision rule hierarchy specifies the side revisions that are orthogonal and sometimes accompany the restructuring revisions discussed up to this point. Side revisions do not make the draft more informative, but instead improve its style, conciseness and unambiguity. For example, when STREAK revises sentence (3) into (4) in the example run of Fig. 1, the Agent of the absorbed clause &amp;quot;Danny Ainge added 21 points&amp;quot; becomes controlled by the new embedding clause &amp;quot;Danny Ainge came off the bench&amp;quot; to avoid the verbose form: ? &amp;quot;Danny Ainge came off the bench for Danny Ainge to add 21 points&amp;quot;.</Paragraph>
  </Section>
  <Section position="5" start_page="206" end_page="209" type="metho">
    <SectionTitle>
4 Evaluation methodology
</SectionTitle>
    <Paragraph position="0"> In the spectrum of possible evaluations, the evaluation presented in this paper is characterized as follows: * Its object is the revision rule hierarchy acquired from the sports summary corpus. It thus does not directly evaluate the output of STREAK, but rather the special knowledge structures required by its underlying revision-based model.</Paragraph>
    <Paragraph position="1"> s i.e., the source pattern expresses the same concept combination than the target pattern minus one concept. The particular property of this revision rule hierarchy that is evaluated is cross-domain portability: how much of it could be re-used to generate summaries in another domain, namely the stock market? The basis for this evaluation is corpus data 9.</Paragraph>
    <Paragraph position="2"> The original sports summary corpus from which the revision rules were acquired is used as the 'training' (or acquisition) corpus and a corpus of stock market reports taken from several newswires is used as the 'test' corpus. This test corpus comprises over 18,000 sentences.</Paragraph>
    <Paragraph position="3"> The evaluation procedure is quantitative, measuring percentages of revision rules whose target and source realization patterns are observable in the test corpus. It is also semi-automated through the use of the corpus search tool CREP (Duford, 1993) (as explained below).</Paragraph>
    <Paragraph position="4"> Basic principle As explained in section 3, a revision rule is associated with a list of surface decrement pairs, each one consisting of: A source pattern whose content and linguistic form match the triggering conditions of the rule (e.g., R~ in Fig. 3 for the rule Adjunctization of Range into Instrument).</Paragraph>
    <Paragraph position="5"> A target pattern whose content and linguistic form can be derived from the source pattern by applying the rule (e.g., R 2 in Fig. 3 for the rule Adjunctization of Range into Instrument).</Paragraph>
    <Paragraph position="6"> This list of decrement pairs can thus be used as the signature of the revision rule to detect its usage in the test corpus. The needed evidence is the simultaneous presence of two test corpus sentences 1deg , each one respectively matching the source and target patterns of at least one element in this list. Requiring occurrence of the source pattern in the test corpus is necessary for the computation of conservative portability estimates: while it may seem that one target pattern alone is enough evidence, without the presence of the corresponding source pattern, one cannot rule out the possibility that, in the test domain, this target pattern is either a basic pattern or derived from another source pattern using another revision rule.</Paragraph>
    <Paragraph position="7"> 9Only the corpus analysis was performed for both domains. The implementation was not actually ported to the stock market domain.</Paragraph>
    <Paragraph position="8">  Partially automating the evaluation The software tool CREP 11 was developed to partially automate detection of realization patterns in a text corpus. The basic idea behind CREP is to approximate a realization pattern by a regular expression whose terminals are words or parts-of-speech tags (POStags). CR~.P will then automatically retrieve the corpus sentences matching those expressions. For example, the CREP expression C~1 below approximates the realization pattern R~ shown in Fig. 3:</Paragraph>
    <Paragraph position="10"> In the expression above, 'VBD', 'NN' and 'IN' are the POS-tags for past verb, singular noun and preposition (respectively), and the sub-expressions 'TEAH' and 'SCORE' (whose recursive definitions are not shown here) match the team names and possible final scores (respectively) in the NBA. The CREP operators 'N=' and 'N-' (N being an arbitrary integer) respectively specify exact and minimal distance of N words, and 'l' encodes disjunction.</Paragraph>
    <Paragraph position="11"> llcREP was implemented (on top of FLEX, GNUS' version of LEX) and to a large extent also designed by Duford. It uses Ken Church's POS tagger.</Paragraph>
    <Paragraph position="12"> Because a realization pattern abstracts away from lexical items to capture the mapping from concepts to syntactic structure, approximating such a pattern by a regular expression of words and POS-tags involves encoding each concept of the pattern by the disjunction of its alternative lexicalizations. In a given domain, there are therefore two sources of inaccuracy for such an approximation: * Lexical ambiguity resulting in false positives by over-generalization.</Paragraph>
    <Paragraph position="13"> * Incomplete vocabulary resulting in false negatives by over-specialization 12.</Paragraph>
    <Paragraph position="14"> Lexical ambiguities can be alleviated by writing more context-sensitive expressions. The vocabulary can be acquired through additional exploratory CREP runs with expressions containing wild-cards for some concept slots. Although automated corpus search using CREP expressions considerably speedsup corpus analysis, manual intervention remains 12This is the case for example of C1 above, which is a simplification of the actual expression that was used to search occurrences of R~ in the test corpus (e.g., Cz is missing &amp;quot;win&amp;quot; and &amp;quot;rout&amp;quot; as alternatives for &amp;quot;victory&amp;quot;).  necessary to filter out incorrect matches resulting from imperfect approximations.</Paragraph>
    <Paragraph position="15"> Cross-domain discrepancies Basic similarities between the finance and sports domains form the basis for the portability of the revision rules. In both domains, the core facts reported are statistics compiled within a standard temporal unit (in sports, one ballgame; in finance, one stock market session) together with streaks 13 and records compiled across several such units. This correspondence is, however, imperfect. Consequently, before they can track down usage of a revision rule in the test domain, the CREP expressions approximating the signature of the rule in the acquisition domain must be adjusted for cross-domain discrepancies to prevent false negatives. Two major types of adjustments are necessary: lexical and thematic.</Paragraph>
    <Paragraph position="16"> Lexical adjustments handle cases of partial mis-match between the respective vocabularies used to lexicalize matching conceptual structures in each domain. (e.g.,, the verb &amp;quot;to rebound from&amp;quot; expresses the interruption of a streak in the stock market domain, while in the basketball domain &amp;quot;to break&amp;quot; or &amp;quot;to snap&amp;quot; are preferred since &amp;quot;to rebound&amp;quot; is used to express a different concept).</Paragraph>
    <Paragraph position="17"> Thematic adjustments handle cases of partial differences between corresponding conceptual structures in the acquisition and test domains. For example, while in sports garae-result involves antagonistic teams, its financial domain counterpart session-result concerns only a single indicator.</Paragraph>
    <Paragraph position="18"> Consequently, the sub-expression for the loser role in the example CI:tEP expression (~1 shown before, and which approximates realization pattern /~ for game-resull; (shown in Fig. 3), needs to become optional in order to also approximate patterns for session-resul~. This is done using the CREP operator ? as shown below:  Note that it is the CREP expressions used to automatically retrieve test corpus sentence pairs attesting usage of a revision rule that require this type of adjustment and not the revision rule itself 14. For example, the Adjoin of Frequency PP to Clause revision rule attaches a streak to a session-result clause without loser role in exactly the same way than it attaches a streak to a game-result with 13i.e., series of events with similar outcome.</Paragraph>
    <Paragraph position="19"> 14Some revision rules do require adjustment, but of another type (cfl Sect. 5).</Paragraph>
    <Paragraph position="20"> loser role. This is illustrated by the two corpus sentences below: P~: &amp;quot;The Chicago Bulls beat the Phoenix Suns 99 91 for their 3rd straight win&amp;quot; pt: &amp;quot;The Amex Market Value Index inched up 0.16 to 481.94 for its sixth straight advance&amp;quot; Detailed evaluation procedure The overall procedure to test portability of a revision rule consists of considering the surface decrement pairs in the rule signature in order, and repeating the following steps: 1. Write a CREP expression for the acquisition target pattern.</Paragraph>
    <Paragraph position="21"> 2. Iteratively delete, replace or generalize sub-expressions in the CREP expression - to gloss over thematic and lexical discrepancies between the acquisition and test domains, and prevent false negatives - until it matches some test corpus sentence(s).</Paragraph>
    <Paragraph position="22"> 3. Post-edit the file containing these matched sentences. If it contains only false positives of the sought target pattern, go back to step 2. Otherwise, proceed to step 4.</Paragraph>
    <Paragraph position="23"> 4. Repeat step (1-3) with the source pattern of the pair under consideration. If a valid match can  also be found for this source pattern, stop: the revision rule is portable. Otherwise, start over from step 1 with the next surface decrement pair in the revision rule signature. If there is no next pair left, stop: the revision rule is considered non-portable.</Paragraph>
    <Paragraph position="24"> Steps (2,3) constitute a general, generate-and-test procedure to detect realization patterns usage in a corpus 15. Changing one CKEP sub-expression may result in going from too specific an expression with no valid match to either: (1) a well-adjusted expression with a valid match, (2) still too specific an expression with no valid match, or (3) already too general an expression with too many matches to be manually post-edited.</Paragraph>
    <Paragraph position="25"> It is in fact always possible to write more context-sensitive expressions, to manually edit larger nomatch files, or even to consider larger test corpora in the hope of finding a match. At some point however, one has to estimate, guided by the results of previous runs, that the likelihood of finding a match is too 15And since most generators rely on knowledge structures equivalent to realization patterns, this procedure can probably be adapted to semi-automatically evaluate the portability of virtually any corpus-based generator.  small to justify the cost of further attempts. This is why the last line in the algorithm reads &amp;quot;considered non-portable&amp;quot; as opposed to &amp;quot;non-portable&amp;quot;. The algorithm guarantees the validity of positive (i.e., portable) results only. Therefore, the figures presented in the next section constitute in fact a lower-bound estimate of the actual revision rule portability.</Paragraph>
  </Section>
  <Section position="6" start_page="209" end_page="209" type="metho">
    <SectionTitle>
5 Evaluation results
</SectionTitle>
    <Paragraph position="0"> The results of the evaluation are summarized in Fig. 4-10. They show the revision rule hierarchy, with portable classes highlighted in bold. The frequency of occurrence of each rule in the acquisition corpus is given below the leaves of the hierarchy.</Paragraph>
    <Paragraph position="1"> Some rules are same-concept portable: they are used to attach corresponding concepts in each domain (e.g., Adjoin of Frequency PP to Clause, as explained in Sect. 4). They could be re-used &amp;quot;as is&amp;quot; in the financial domain. Other rules, however, are only different-concept portable: they are used to attach altogether different concepts in each domain.</Paragraph>
    <Paragraph position="2"> This is the case for example of Adjoin Finite Time Clause to Clause, as illustrated by the two corpus sentences below, where the added temporal adjunct (in bold) conveys a streak in the sports sentence, but a complementary statistics in the financial one: T~: &amp;quot;to lead Utah to a 119-89 trouncing of Denver as the Jazz defeated the Nuggets for the 12th straight time at home.&amp;quot; T~: &amp;quot;Volume amounted to a solid 349 million shares as advances out-paced declines 299 to 218.&amp;quot;.</Paragraph>
    <Paragraph position="3"> For different-concept portable rules, the left-hand side field specifying the concepts incorporable to the draft using this rule will need to be changed when porting the rule to the stock market domain. In Fig. 4-10, the arcs leading same-concept portable classes are full and thick, those leading to different-concept portable classes are dotted, and those leading to a non-portable classes are full but thin.</Paragraph>
    <Paragraph position="4"> 59% of all revision rule classes turned out to be same-concept portable, with another 7% different-concept portable. Remarkably, all eight top-level classes identified in the sports domain had instances same-concept portable to the financial domain, even those involving the most complex non-monotonic revisions, or those with only a few instances in the sports corpus. Among the bottom-level classes that distinguish between revision rule applications in very specific semantic and syntactic contexts, 42% are same-concept portable with another 10% different-concept portable. Finally, the correlation between high usage frequency in the acquisition corpus and portability to the test corpus is not statistically significant (i.e., the hypothesis that the more common a rule, the more likely it is to be portable could not be confirmed on the analyzed sample). See (Robin, 1994b) for further details on the evMuation results.</Paragraph>
    <Paragraph position="5"> There are two main stumbling blocks to portability: thematic role mismatch and side revisions.</Paragraph>
    <Paragraph position="6"> Thematic role mismatches are cases where the semantic label or syntactic sub-category of a constituent added or displaced by the rule differ in each domain (e.g., Adjunctization of Created into Instrument vs. Adjoin of Affected into Instrument). They push portability from 92% down to 71%. Their effect could be reduced by allowing STREAK'S reviser to manipulate the draft down to the surface syntactic role level (e.g., in both corpora Created and Affected surface as object). Currently, the reviser stops at the thematic role level to allow STREAK to take full advantage of the syntactic processing front-end SURGE (Elhadad and Robin, 1996), which accepts such thematic structures as input. null Accompanying side revisions push portability from 71% to 52%. This suggests that the design of STREAK could be improved by keeping side revisions separate from re-structuring revisions and interleaving the applications of the two. Currently, they are integrated together at the bottom of the revision rule hierarchy.</Paragraph>
  </Section>
  <Section position="7" start_page="209" end_page="213" type="metho">
    <SectionTitle>
6 Related work
</SectionTitle>
    <Paragraph position="0"> Apart from STREAK, only three generation projects feature an empirical and quantitative evaluation: ANA (Kukich, 1983), KNIGHT (Lester, 1993) and IMAGENE (Van der Linden, 1993).</Paragraph>
    <Paragraph position="1"> ANA generates short, newswire style summaries of the daily fluctuations of several stock market indexes from half-hourly updates of their values. For evaluation, Kukich measures both the conceptual and linguistic (lexical and syntactic) coverages of ANA by comparing the number of concepts and realization patterns identified during a corpus analysis with those actually implemented in the system.</Paragraph>
    <Paragraph position="2"> KNIGHT generates natural language concept definitions from a large biological knowledge base, relying on SURGE for syntactic realization. For evaluation, Lester performs a Turing test in which a panel of human judges rates 120 sample definitions by assigning grades (from A to F) for: * Semantic accuracy (defined as &amp;quot;Is the definition adequate, providing correct information and focusing on what's important?&amp;quot; in the instructions provided to the judges).</Paragraph>
    <Paragraph position="3"> * Stylistic accuracy (defined as &amp;quot;Does the definition use good prose and is the information it</Paragraph>
    <Section position="1" start_page="210" end_page="213" type="sub_section">
      <SectionTitle>
Recast
</SectionTitle>
      <Paragraph position="0"> of NP of clause from classifier from location from range to qualifier to instrument to time to instrument  conveys well organized&amp;quot; in the instructions provided to the judges).</Paragraph>
      <Paragraph position="1"> The judges did not know that half the definitions were computer-generated while the other half were written by four human domain experts. Impressively, the results show that: * With respect to semantic accuracy, the human judges could not tell KNIGHT apart from the human writers.</Paragraph>
      <Paragraph position="2"> * While as a group, humans got statistically significantly better grades for stylistic accuracy than KNIGHT, the best human writer was singlehandly responsible for this difference. IMAGENE generates instructions on how to operate household devices relying on NIGEL (Mann and Matthiessen, 1983) for syntactic realization. The implementation focuses on a very limited aspect of text generation: the realization of purpose relations. Taking as input the description of a pair &lt;operation, purpose of the operation&gt;, augmented by a set of features simulating the communicative context of generation, IMAGENE selects, among the many realizations of purpose generable by NIGEL (e.g., fronted to-infinitive clause vs. trailing for-gerund clauses), the one that is most appropriate for the simulated context (e.g., in the context of several operations sharing the same purpose, the latter is preferentially expressed before those actions than after them). IMAGENE's contextual preference rules were abstracted by analyzing an acquisition corpus of about 300 purpose clauses from cordless telephone manuMs. For evaluation, Van der Linden compares the purpose realizations picked by IMAGENE to the one in the corresponding corpus text, first on the acquisition corpus and then on a test corpus of about 300 other purpose clauses from manuals for other devices than cordless telephones (ranging from clock radio to automobile). The results show a 71% match on the acquisition corpus 16 and a 52% match on the test corpus.</Paragraph>
      <Paragraph position="3"> The table of Fig. 11 summarizes the difference on both goal and methodology between the evaluations carried out in the projects ANA, KNIGHT, IMAGENE and STREAK. In terms of goals, while Kukich and Lester evaluate the coverage or accuracy of a particular implementation, I instead focus on three properties inherent to the use of the revision-based generation model underlying STREAK: robustness (how much of other text samples from the same domain can be generated without acquiring new knowledge?) and scalability (how much more new knowledge is needed to fully cover these other samples?) discussed in (Robin and McKeown, 1995), and portability to another domain in the present paper. Van der Linden does a little bit of both by first measuring the stylistic accuracy of his system for a very restricted sub-domain, and then measuring how it degrades for a more general domain.</Paragraph>
      <Paragraph position="4"> In itself, measuring the accuracy and coverage of a particular implementation in the sub-domain for which it was designed brings little insights about what generation approach should be adopted in future work. Indeed, even a system using mere canned text can be very accurate and attain substantial coverage if enough hand-coding effort is put into it. However, all this effort will have to be entirely duplicated each time the system is scaled up or ported to a new domain. Measuring how much of this effort duplication can be avoided when relying on revision-based generation was the very object of the three evaluations carried in the STREAK project.</Paragraph>
      <Paragraph position="5"> 16This imperfect match on the acquisition corpus seems to result from the heuristic nature of IMAGENE's stylistic preferences: individually, none of them needs to apply to the whole corpus.</Paragraph>
      <Paragraph position="6">  In terms of methodology, the main originality of these three evaluations is the use of CREP to partially automate reverse engineering of corpus sentences. Beyond evaluation, CREP is a simple, but general and very handy tool that should prove useful to speed-up a wide range of corpora analyses.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML