File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/p05-1008_metho.xml
Size: 23,082 bytes
Last Modified: 2025-10-06 14:09:40
<?xml version="1.0" standalone="yes"?> <Paper uid="P05-1008"> <Title>Empirically-based Control of Natural Language Generation</Title> <Section position="4" start_page="58" end_page="58" type="metho"> <SectionTitle> 2 Overview of the Approach </SectionTitle> <Paragraph position="0"> Our overall approach has two phases: (1) offline calculation of the control parameters, and (2) online application to generation. In the first phase we determine a set of correlation equations, which capture the relationship between surface linguistic features of generated texts and the internal generator decisions that gave rise to those texts (see figure 1). In the second phase, these correlations are used to guide the generator to produce texts with particular surface feature characteristics (see figure 2).</Paragraph> <Paragraph position="1"> The starting point is a corpus of texts which represents all the variability that we wish to capture. Counts for (surface) linguistic features from the texts in the corpus are obtained, and a factor analysis is used to establish dimensions of variation in terms of these counts: each dimension is defined by a weighted sum of scores for particular features, and factor analysis determines the combination that best accounts for the variability across the whole corpus. This provides a language variation model which can be used to score a new text along each of the identified dimensions, that is, to locate the text in the variation space determined by the corpus.</Paragraph> <Paragraph position="2"> The next step is to take a generator which can generate across the range of variation in the corpus, and identify within it the key choice points (CP1, CP2, ... CPn) in its generation of a text. We then allow the generator to freely generate all possible texts from one or more inputs. For each text so generated we record (a) the text's score according to the variation model and (b) the set of decisions made at each of the selected choice points in the generator. Finally, for a random sample of the generated texts, a statistical correlation analysis is undertaken between the scores and the corresponding generator decisions, resulting in correlation equations which predict likely variation scores from generator decisions.</Paragraph> <Paragraph position="3"> In the second phase, the generator is adapted to use the correlation equations to conduct a best-first search of the generation space. As well as the usual input, the generator is supplied with target scores for each dimension of variation. At each choice point, the correlation equations are used to predict which choice is most likely to move closer to the target score for the final text.</Paragraph> <Paragraph position="4"> This basic architecture makes no commitment to what is meant by 'variation', 'linguistic features', 'generator choice points', or even 'NLG system'.</Paragraph> <Paragraph position="5"> The key ideas are that a statistical analysis of surface features of a corpus of texts can be used to define a model of variation; this model can then be used to control a generator; and the model can also be used to evaluate the generator's performance. In the next section we describe a concrete instantiation of this architecture, in which 'variation' is stylistic variation as characterised by a collection of shallow lexical and syntactic features.</Paragraph> </Section> <Section position="5" start_page="58" end_page="59" type="metho"> <SectionTitle> 3 An Implemented System </SectionTitle> <Paragraph position="0"> In order to evaluate the effectiveness of this general approach, we implemented a system which attempts to control style of text generated as de- null fined by Biber (1988) in short text (typically 2-3 sentences) describing medicine dosage instructions. null</Paragraph> <Section position="1" start_page="59" end_page="59" type="sub_section"> <SectionTitle> 3.1 Factor Analysis </SectionTitle> <Paragraph position="0"> Biber characterised style in terms of very shallow linguistic features, such as presence of pronouns, auxiliaries, passives etc. By using factor analysis techniques he was able to determine complex correlations between the occurrence and non-occurrence of such features in text, which he used to characterise different styles of text.2 We adopted the same basic methodology, applied to a smaller more consistent corpus of just over 300 texts taken from proprietary patient information leaflets. Starting with around 70 surface linguistic features as variables, our factor analysis yielded two main factors (each containing linguistic features grouped in positive and negative correlated subgroups) which we used as our dimensions of variation. We interpreted these dimensions as follows (this is a subjective process -- factor analysis does not itself provide any interpretation of factors): dimension 1 ranges from texts that try to involve the reader (high positive score) to text that try to be distant from the reader (high negative score); dimension 2 ranges from texts with more pronominal reference and a higher proportion of certain verbal forms (high positive score) to text that use full nominal reference (high negative score).3</Paragraph> </Section> <Section position="2" start_page="59" end_page="59" type="sub_section"> <SectionTitle> 3.2 Generator Architecture </SectionTitle> <Paragraph position="0"> The generator was constructed from a mixture of existing components and new implementation, using a fairly standard overall architecture as shown in figure 3. Here, dotted lines show the control flow and the straight lines show data flow -- the choice point annotations are described below.</Paragraph> <Paragraph position="1"> The input constructor takes an input specification and, using a background database of medicine information, creates a network of concepts and re2 Some authors (e.g. Lee (1999)) have criticised Biber for making assumptions about the validity and generalisability of his approach to English language as a whole. Here, however, we use his methodology to characterise whatever variation exists without needing to make any broader claims.</Paragraph> </Section> </Section> <Section position="6" start_page="59" end_page="61" type="metho"> <SectionTitle> 3 Full details of the factor analysis can be found in </SectionTitle> <Paragraph position="0"> (Paiva 2000).</Paragraph> <Paragraph position="1"> lations (see figure 4) using a schema-based approach (McKeown, 1985).</Paragraph> <Paragraph position="2"> Each network is then split into subnetworks by the split network module. This partitions the network by locating 'proposition' objects (marked with a double-lined box in figure 4) which have no parent and tracing the subnetwork reachable from each one. We call these subnetworks propnets. In figure 4, there are two propnets, rooted in [1:take] and [9:state] -- proposition [15:state] is not a root as it can be reached from [1:take]. A list of all possible groupings of these propnets is obtained4, and one of the possible combinations is passed to the network ordering module. This is the first source of non-determinism in our system, marked as choice point one in figure 3. A combination of subnetworks will be material for the realisation of one paragraph and each subnetwork will be realised as one sentence.</Paragraph> <Paragraph position="3"> 4 For instance, with three propnets (A, B and C) the list of combinations would be [(A,B,C), (A,BC), (AB, C), input constructor5 The network ordering module receives a combination of subnetworks and orders them based on the number of common elements between each subnetwork. The strategy is to try to maximise the possibility of having a smooth transition from one sentence to the next in accordance with Centering Theory (Grosz et al., 1995), and so increase the possibility of having a pronoun generated.</Paragraph> <Paragraph position="4"> The referring expression module receives one subnetwork at a time and decides, for each object that is of type [thing], which type of referring expression will be generated. The module is re-used from the Riches system (Cahill et al., 2001) and it generates either a definite description or a pronoun. This is the second source of non-determinism in our system, marked as choice point two in figure 3. Referring expression decisions are recorded by introducing additional nodes into the network, as shown for example in figure 5 (a fragment of the network in figure 4, with the additional nodes).</Paragraph> <Paragraph position="5"> NP pruning is responsible for erasing from a referring expression subnetwork all the nodes that can be transitively reached from a node marked to be pronominalised. This prevents the realiser from trying to express the information twice. In figure 5, [7:dose] is marked to be pronominalised, so the concepts [11:of] and [3:medicine] do not need to be realised, so they are pruned.</Paragraph> <Paragraph position="6"> 5 Although some of the labels in this figure look like words, they bear no direct relation to words in the surface text -- for example, 'of' may be realised as a genitive construction or a possessive.</Paragraph> <Paragraph position="7"> The realiser is a re-implementation of Nicolov's (1999) generator, extended to use the wide-coverage lexicalised grammar developed in the LEXSYS project (Carroll et al., 2000), with further semantic extensions for the present system. It selects grammar rules by matching their semantic patterns to subnetworks of the input, and tries to generate a sentence consuming the whole input. In general there are several rules linking each piece of semantics to its possible realisation, so this is our third, and most prolific, source of non-determinism in the architecture, marked as choice point three in figure 3.</Paragraph> <Paragraph position="8"> A few examples of outputs for the input represented in figure 4 are: the dose of the patient 's medicine is taken twice a day. it is two grams.</Paragraph> <Paragraph position="9"> the two-gram dose of the patient 's medicine is taken twice a day.</Paragraph> <Paragraph position="10"> the patient takes the two-gram dose of the patient 's medicine twice a day.</Paragraph> <Paragraph position="11"> From a typical input corresponding to 2-3 sentences, this generator will generate over a 1000 different texts.</Paragraph> <Section position="1" start_page="59" end_page="61" type="sub_section"> <SectionTitle> 3.3 Tracing Generator Behaviour </SectionTitle> <Paragraph position="0"> In order to control the generator's behaviour we first allow it to run freely, recording a 'trace' of the decisions it makes at each choice point during the production of each text. Although there are only three choice points in figure 3, the control structure included two loops: an outer loop which ranges over the sequence of propnets, generating a sentence for each one, and an inner loop which ranges over subnetworks of a propnet as realisation rules are chosen. So the decision structure for even a small text may be quite complex.</Paragraph> <Paragraph position="1"> In the experiments reported here, the trace of the generation process is simply a record of the number of times each decision (choice point, and what choice was made) occurred. Paiva (2004) discusses more complex tracing models, where the context of each decision (for example, what the preceding decision was) is recorded and used in the correlation. However the best results were obtained using just the simple decision-counting model (perhaps in part due to data sparseness for more complex models).</Paragraph> </Section> <Section position="2" start_page="61" end_page="61" type="sub_section"> <SectionTitle> 3.4 Correlating Decisions with Text Features </SectionTitle> <Paragraph position="0"> By allowing the generator to freely generate all possible output from a single input, we recorded a set of <trace, text> pairs ranging across the full variation space. From these pairs we derived corresponding <decision-count, factor-score> pairs, to which we applied a very simple correlational technique, multivariate linear regression analysis, which is used to find an estimator function for a linear relationship (i.e., one that can be approximated by a straight line) from the data available for several variables (Weisberg, 1985). In our case we want to predict the value for a score in a stylistic dimension (SSi) based on a configuration of generator decisions (GDj) as seen in equation 1.</Paragraph> <Paragraph position="1"> (eq. 1) SSi = x0 + x1GD1 + ... + xnGDn + e 6 We used three randomly sampled data sets of 1400, 1400 and 5000 observations obtained from a potential base of about 1,400,000 different texts that could be produced by our generator from a single input. With each sample, we obtained a regression equation for each stylistic dimension separately. In the next subsections we will present the final results for each of the dimensions separately. null Regression on Stylistic Dimension 1 For the regression model on the first stylistic dimension (SS1), the generator decisions that were used in the regression analysis7 are: imperative with one object sentences (IMP_VNP), V_NP_PP agentless passive sentences (PAS_VNPP), V_NP bypassives (BYPAS_VN), and N_PP clauses (NPP) and these are all decisions that happen in the realiser, i.e., at the third choice point in the architecture.</Paragraph> <Paragraph position="2"> This resulted in the regression equation shown in equation 2.</Paragraph> </Section> </Section> <Section position="7" start_page="61" end_page="61" type="metho"> <SectionTitle> 6 SS </SectionTitle> <Paragraph position="0"> i represents a stylistic score and is the dependent variable or criterion in the regression analysis; the GDj's represent generator decisions and are called the independent variables or predictors; the xj's are weights, and e is the error.</Paragraph> </Section> <Section position="8" start_page="61" end_page="62" type="metho"> <SectionTitle> 7 The process of determining the regression takes care </SectionTitle> <Paragraph position="0"> of eliminating the variables (i.e. generator decisions) that are not useful to estimate the stylistic dimensions.</Paragraph> <Paragraph position="2"> The coefficients for the regression on SS1 are unstandardised coefficients, i.e. the ones that are used when dealing with raw counts for the generator decisions.</Paragraph> <Paragraph position="3"> The coefficient of determination (R2), which measures the proportion of the variance of the dependent variable about its mean that is explained by the independent variables, had a reasonably high value (.895)9 and the analysis of variance obtained an F test of 1701.495.</Paragraph> <Paragraph position="4"> One of the assumptions that this technique assumes is the linearity of the relation between the dependent and the independent variables (i.e., in our case, between the stylistic scores in a dimension and the generator decisions). The analysis of the residuals resulted in a graph that had some problems but that resembled a normal graph (see (Paiva, 2004) for more details).</Paragraph> <Paragraph position="5"> Regression on Stylistic Dimension 2 For the regression model on the second stylistic dimension (SS2) the variables that we used were: the number of times a network was split (SPLIT-NET), generation of a pronoun (RE_PRON), auxiliary verb (VAUX), noun with determiner (NOUN), transitive verb (VNP), and agentless passive (PAS_VNP) -- the first type of decision happens in the split network module (our first choice point); the second, in the referring expression module (second choice point); and the rest in the realiser (third choice point).</Paragraph> <Paragraph position="6"> The main results for this model are as follows: the coefficient of determination (R2) was .959 and the analysis of variance obtained an F test of 2298.519. The unstandardised regression coefficients for this model can be seen in eq. 3.</Paragraph> <Paragraph position="8"> 8 This specific equation came from the sample with 5,000 observations -- the equations obtained from the other samples are very similar to this one.</Paragraph> <Paragraph position="9"> 9 All the statistical results presented in this paper are significant at the 0.01 level (two-tailed).</Paragraph> <Paragraph position="10"> 10 This specific equation comes from one of the samples of 1,400 observations.</Paragraph> <Paragraph position="11"> With this second model we did not find any problems with the linearity assumptions as the analysis of the residuals gave a normal graph.</Paragraph> </Section> <Section position="9" start_page="62" end_page="321" type="metho"> <SectionTitle> 4 Controlling the Generator </SectionTitle> <Paragraph position="0"> These regression equations characterise the way in which generator decisions influence the final style of the text (as measured by the stylistic factors). In order to control the generator, the user specifies a target stylistic score for each dimension of the text to be generated. At each choice point during generation, all possible decisions are collected in a list and the regression equations are used to order them. The equations allow us to estimate the subsequent values of SS1 and SS2 for each of the possible decisions, and the decisions are ordered according to the distance of the resulting scores from the target scores -- the closer the score, the better the decision.</Paragraph> <Paragraph position="1"> Hence the search algorithm that we are using here is the best-first search, i.e., the best local solution according to an evaluation function (which in this case is the Euclidian distance from the target and the resulted value obtained by using the regression equation) is tried first but all the other local solutions are kept in order so backtracking is possible.</Paragraph> <Paragraph position="2"> In this paper we report on tests of two internal aspects of the system11. First we wish to know how good the generator is at hitting a user-specified target -- i.e., how close are the scores given by the regression equations for the first text generated to the user's input target scores. Second, we wish to know how good the regression equation scores are at modelling the original stylistic factors -- i.e., we want to compare the regression scores of an output text with the factor analysis scores. We address these questions across the whole of the two-dimensional stylistic space, by specifying a rectangular grid of scores spanning the whole space, and asking the generator to produce texts for each grid point from the same semantic input specification.</Paragraph> <Paragraph position="3"> 11 We are not dealing with external (user) evaluation of the system and of the stylistic dimensions we obtained -- this was left for future work. Nonetheless, Sigley (1997) showed that the dimensions obtained with factor analysis and people's perception have a high correlation.</Paragraph> <Paragraph position="4"> In this case we divided the scoring space with an 8 by 10 grid pattern as shown in figure 6.12 Each point specifies the target scores for each text that should be generated (the number next to each point is an identifier of each text). For instance, text number 1 was targeted at coordinate ([?]7, [?]44), whereas text number 79 was targeted at coordinate (+7, [?]28).</Paragraph> <Section position="1" start_page="62" end_page="63" type="sub_section"> <SectionTitle> 4.1 Comparing Target Points and Regression Scores </SectionTitle> <Paragraph position="0"> In the first part of this experiment we wanted to know how close to the user-specified target coordinates the resulting regression scores of the first generated text were. This can be done in two different ways. The first is to plot the resulting regression scores (see figure 7) and visually check if it mirrors the grid-shape pattern of the target points (figure 6) -- this can be done by inspecting the text identifiers13. This can be a bit misleading because there will always be variation around the target point that was supposed to be achieved (i.e., there is a margin for error) and this can blur the comparison unfavourably.</Paragraph> <Paragraph position="1"> 12 The range for each scale comes from the maximum and minimum values for the factors obtained in the samples of generated texts.</Paragraph> <Paragraph position="2"> 13 Note that some texts obtained the same regression score and, in the statistical package, only one was numbered. Those instances are: 1 and 7; 18 and 24; 22 and 28.</Paragraph> <Paragraph position="3"> regression equation A more formal comparison can be made by plotting the target points versus the regression results for each dimension separately and obtaining a correlation measure between these values. These correlations are shown in figure 8 for SS1 (left) and SS2 (right). The degree of correlation (R2) between the values of target and regression points is 0.9574 for SS1 and 0.942 for SS2, which means that the search mechanism is working very satisfactorily on</Paragraph> </Section> <Section position="2" start_page="63" end_page="321" type="sub_section"> <SectionTitle> 4.2 Comparing Target Points and Stylistic Scores </SectionTitle> <Paragraph position="0"> In the second part of this experiment we wanted to know whether the regression equations were doing the job they were supposed to do by comparing the regression scores with stylistic scores obtained (from the factor analysis) for each of the generated texts. In figure 9 we plotted the texts in a graph in accordance with their stylistic scores (once again, some texts occupy the same point so they do not appear).</Paragraph> <Paragraph position="1"> 14 All the correlational figures (R2) presented for this experiment are significant at the 0.01 level (twotailed). null obtained in our factor analysis In the ideal situation, the generator would have produced texts with the perfect regression scores and they would be identical to the stylistic scores, so the graph in the figure 9 would be like a grid-shape one as in figure 6. However we have already seen in figure 7, that this is not the case for the relation between the target coordinates and the regression scores. So we did not expect the plot of stylistic scores 1 (SS1) against stylistic scores 2 (SS2) to be a perfect grid.</Paragraph> <Paragraph position="2"> Figure 10 (left-hand side) shows the relation between the target points and the scores obtained from the original factor equation of SS1. The value of R2, which represents their correlation, is high (0.9458), considering that this represents the possible accumulation of errors of two stages: from the target to the regression scores, and then from the regression to the actual factor scores. On the right of figure 10 we can see the plotting of the target points and their respective factor scores on SS2.</Paragraph> <Paragraph position="3"> The correlation obtained is also reasonably high</Paragraph> </Section> </Section> class="xml-element"></Paper>