File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/05/w05-1601_concl.xml
Size: 4,675 bytes
Last Modified: 2025-10-06 13:55:00
<?xml version="1.0" standalone="yes"?> <Paper uid="W05-1601"> <Title>Statistical Generation: Three Methods Compared and Evaluated</Title> <Section position="6" start_page="0" end_page="0" type="concl"> <SectionTitle> 5 Discussion and Further Research </SectionTitle> <Paragraph position="0"> The n-gram model's bias towards shorter strings is an example of a general case: whenever the likelihood of a larger unit that can vary in length (e.g. sentence) is modelled in terms of the joint probability of length-invariant smaller units, larger units that are composed of fewer smaller units are more likely.</Paragraph> <Paragraph position="1"> The possibility of counteracting this bias has been investigated. In parsing, Magerman & Marcus [1991] and Briscoe & Carroll [1993] used the geometric mean of probabilities instead of the product of probabilities. In later work, Briscoe & Carroll [1992] use a normalisation approach, the equivalent of which for n-gram selection would be to 'pad' shorter alternatives with extra 'dummy' words up to the length of the longest alternative, and to use the geometric mean of the probabilities assigned by the n-gram model to the nondummy words as the probability of any dummy word. It has been observed that such methods turn a principled probabilistic model into an ad hoc scoring function [Manning and Schuetze, 1999, p. 443]. It certainly means getting rid of the n-gram model's particular independence assumptions, without replacing them with a principled alternative.</Paragraph> <Paragraph position="2"> The NLG community is traditionally wary of using evaluation metrics like the ones used here. NLU, and in particular the numbers-driven parsing community, has not tended to worry about the fact that the gold standard parse is usually not the only correct one. It is accepted that it is an imperfect way of evaluating results, albeit the only realistic possibility, especially for large amounts of data. Furthermore, where a parser or generator has been trained directly on a corpus, it is a fair way of estimating how well a method has succeeded in its aim, namely to learn from the corpus.</Paragraph> <Paragraph position="3"> Some typical example outputs from all generators are shown in an appendix to this paper. The greedy and Viterbi generators tend to have very similar output. Both rely on a small number of short, high-frequency phrases, e.g. SOON, VEERING, INCREASING, and LATER. The 2-gram generator has a similar tendency, but with a larger number of phrases and more variation. If the generation rules allow it, the greedy generator always prefixes SOON (all three examples), whereas the Viterbi generator can avoid it (third example).</Paragraph> <Paragraph position="4"> The quality of the 2-gram generator's output is independent of the quality of the weather forecast generation rules, as long as they reflect all the variation in the corpus. However, the other two generators are entirely dependent on the quality of the generation rules, in particular on whether enough information is incorporated in the rules to base decisions on. The only room for improving the 2-gram generator is in using alternative statistical estimation techniques, but there is a lot of potential for improving the probabilistic-rule-driven generators, by (i) improving the quality of the generation rules (e.g. by incorporating more contextual information in the generation rules, in particular always including a time stamp); (ii) using alternative methods for building treebank models (e.g.</Paragraph> <Paragraph position="5"> extending the local context of rules has proved useful in parsing); and (iii) using alternative methods for exploiting tree-bank models during generation, e.g. it would be good to have a generation strategy that has (some of) the vastly superior efficiency of the greedy generator without its repetitiveness, while not sacrificing (too much of) the overall likelihood of making the right decision (unlike the non-uniform random strategy discussed in Section 3.1). Future research will look at all three areas, using for evaluation a larger and more varied corpus from a different domain as well as the SUMTIME corpus.</Paragraph> <Paragraph position="6"> The three statistical generation methods evaluated in this paper all work with raw corpora, but vary hugely in efficiency.</Paragraph> <Paragraph position="7"> Generation speed and the ability to adapt generators to new domains with no annotation bottleneck are crucial for the development of practical, generic NLG tools. The work presented in this paper is intended to be a step in this direction.</Paragraph> </Section> class="xml-element"></Paper>