File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-1016_metho.xml

Size: 14,849 bytes

Last Modified: 2025-10-06 14:08:26

<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-1016">
  <Title>Statistical Acquisition of Content Selection Rules for Natural Language Generation</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Domain: Biographical Descriptions
</SectionTitle>
    <Paragraph position="0"> The research described here is done for the automatic construction of the Content Selection module of PROGENIE (Duboue and McKeown, 2003a), a biography generator under construction. Biography generation is an exciting field that has attracted practitioners of NLG in the past (Kim et al., 2002; Schiffman et al., 2001; Radev and McKeown, 1997; Teich and Bateman, 1994). It has the advantages of being a constrained domain amenable to current generation approaches, while at the same time offering more possibilities than many constrained domains, given the variety of styles that biographies exhibit, as well as the possibility for ultimately generating relatively long biographies.</Paragraph>
    <Paragraph position="1"> We have gathered a resource of text and associated knowledge in the biography domain. More specifically, our resource is a collection of human-produced texts together with the knowledge base a generation system might use as input for generation. The knowledge base contains many pieces of information related to the person the biography talks about (and that the system will use to generate that type of biography), not all of which necessarily will appear in the biography. That is, the associated knowledge base is not the semantics of the target text but the larger set1 of all things that could possibly be said about the person in question. The intersection between the input knowledge base and the semantics of the target text is what we are interested in capturing by means of our statistical techniques.</Paragraph>
    <Paragraph position="2"> To collect the semantic input, we crawled 1,100 HTML pages containing celebrity fact-sheets from the E! Online website.2 The pages comprised information in 14 categories for actors, directors, producers, screenwriters, etc. We then proceeded to transform the information in the pages to a frame-based knowledge representation. The final corpus contains 50K frames, with 106K frame-attribute-value triples, for the 1,100 people mentioned in each factsheet. An example set of frames is shown in Figure 3.</Paragraph>
    <Paragraph position="3"> The text part was mined from two different websites, biography.com, containing typical biographies, with an average of 450 words each; and imdb.com, the Internet movie database, 250-word average length biographies. In each case, we obtained the semantic input from one website and a separate biography from a second website. We linked the two resources using techniques from record linkage in census statistical analysis (Fellegi and Sunter, 1969). We based our record linkage on the Last Name, First Name, and Year of Birth attributes. null</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Methods
</SectionTitle>
    <Paragraph position="0"> Figure 2 illustrates our two-step approach. In the first step (shaded region of the figure), we try to identify and solve the easy cases for Content Selection. The easy cases in our task are pieces of data that are copied verbatim from the input to the output. In biography generation, this includes names, dates of birth and the like. The details of this process are discussed in Section 3.1. After these cases have been addressed, the remaining semantic data is clustered and the text corresponding to each cluster post-processed to measure degrees of influence for different semantic units, presented in Section 3.2.</Paragraph>
    <Paragraph position="1"> Further techniques to improve the precision of the algorithm are discussed in Section 3.3.</Paragraph>
    <Paragraph position="2"> Central to our approach is the notion of data paths in the semantic network (an example is shown in Figure 3). Given a frame-based representation of knowledge, we need to identify particular pieces of knowledge inside the graph. We do so by selecting a particular frame as the root of the graph (the person whose biography we are generating, in our case, doubly circled in the figure) and considering the paths in the graph as identifiers for the different pieces of data. We call these data paths. Each path will identify a class of values, given the fact that some attributes are list-valued (e.g., the relative attribute in the figure). We use the notation</Paragraph>
    <Paragraph position="4"> to denote data paths.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Exact Matching
</SectionTitle>
      <Paragraph position="0"> In the first stage (cf. Fig. 2(1)), the objective is to identify pieces from the input that are copied verbatim to the output. These types of verbatim-copied anchors are easy to identify and they allow us do two things before further analyzing the input data: remove this data from the input as it has already been selected for inclusion in the text and mark this piece of text as a part of the input, not as actual text.</Paragraph>
      <Paragraph position="1"> The rest of the semantic input is either verbalized (e.g., by means of a verbalization rule of the form a0 brother agea7a10a9a12a11a14a13a16a15 &amp;quot;young&amp;quot;) or not included at all. This situation is much more challenging and requires the use of our proposed statistical selection technique.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Statistical Selection
</SectionTitle>
      <Paragraph position="0"> For each class in the semantic input that was not ruled out in the previous step (e.g.,</Paragraph>
      <Paragraph position="2"> (cf. Fig. 2(2)) the possible values in the path, over all people (e.g., a17a19a18a21a20a23a22a25a24a27a26a28a20a30a29a32a31a34a33a36a35a37a17a38a29a14a13a39a20a30a22a40a24a27a26a28a20 a13a34a41a32a33a36a35a37a17a38a13a8a18a16a20a12a22a40a24a27a26a42a20a12a43a34a41a32a33 for age). Clustering details can be found in (Duboue and McKeown, 2003b).</Paragraph>
      <Paragraph position="3"> In the case of free-text fields, the top level, most informative terms,3 are picked and used for the clustering. For example, for &amp;quot;played an insecure young resident&amp;quot; it would be a17 playeda44 insecurea44 residenta33 .</Paragraph>
      <Paragraph position="4"> Having done so, the texts associated with each 3We use the maximum value of the TF*IDF weights for each term in the whole text collection. That has the immediate effect of disregarding stop words.</Paragraph>
      <Paragraph position="5"> cluster are used to derive language models (in our case we used bi-grams, so we count the bi-grams appearing in all the biographies for a given cluster --e.g., all the people with age between 25 and 50 years old, a17a38a29a14a13 a20a42a22a25a24a27a26 a20 a13a34a41a32a33 ). We then measure the variations on the language models produced by the variation (clustering) on the data. What we want is to find a change in word choice correlated with a change in data. If there is no correlation, then the piece of data which changed should not be selected by Content Selection.</Paragraph>
      <Paragraph position="6"> In order to compare language models, we turned to techniques from adaptive NLP (i.e., on the basis of genre and type distinctions) (Illouz, 2000). In particular, we employed the cross entropy4 between two language models a1 a1 and a1 a2 , defined as follows (where a2a4a3a6a5a8a7a10a9 is the probability that a1 assigns to the a11 -gram a7 ):</Paragraph>
      <Paragraph position="8"> Smaller values of a12a14a13 a5 a1 a1 a44 a1 a2a39a9 indicate that a1 a1 is more similar to a1 a2 . On the other hand, if we take  a1 to be a model of randomly selected documents and a1 a2 a model of a subset of texts that are associated with the cluster, then a greater-than-chance a12a14a13 value would be an indicator that the cluster in the semantic side is being correlated with changes in the text side.</Paragraph>
      <Paragraph position="9"> We then need to perform a sampling process, in which we want to obtain a12a14a13 values that would represent the null hypothesis in the domain. We sample two arbitrary subsets of a40 elements each from the total set of documents and compute the a12a14a13 of their derived language models (these a12a25a13 values constitute our control set). We then compare, again, a random sample of size a40 from the cluster against a random sample of size a40 from the difference between the whole collection and the cluster (these a12a25a13 values constitute our experiment set). To see whether the values in the experiment set are larger (in a stochastic fashion) than the values in the control set, we employed the Mann-Whitney U test (Siegel and Castellan Jr., 1988) (cf. Fig. 2(3)). We performed 20 rounds of sampling (with a40a41a15 a13 ) and tested at the 4Other metrics would have been possible, in as much as they measure the similarity between the two models.</Paragraph>
      <Paragraph position="11"> a41a8a3a41a40a13 significance level. Finally, if the cross-entropy values for the experiment set are larger than for the control set, we can infer that the values for that semantic cluster do influence the text. Thus, a positive U test for any data path was considered as an indicator that the data path should be selected.</Paragraph>
      <Paragraph position="12"> Using simple thresholds and the U test, class-based content selection rules can be obtained. These rules will select or unselect each and every instance of a given data path at the same time (e.g., if a45 relative person name firsta46 is selected, then both &amp;quot;Dashiel&amp;quot; and &amp;quot;Jason&amp;quot; will be selected in Figure 3). By counting the number of times a data path in the exact matching appears in the texts (above some fixed threshold) we can obtain baseline content selection rules (cf. Fig. 2(A)). Adding our statistically selected (by means of the cross-entropy sampling and the U test) data paths to that set we obtain class-based content selection rules (cf. Fig. 2(B)).</Paragraph>
      <Paragraph position="13"> By means of its simple algorithm, we expect these rules to overtly over-generate, but to achieve excellent coverage. These class-based rules are relevant to the KR concept of Viewpoints (Acker and Porter, 1994);5 we extract a slice of the knowledge base that 5they define them as a coherent sub-graph of the knowledge base describing the structure and function of objects, the change made to objects by processes, and the temporal attributes and temporal decompositions of processes.</Paragraph>
      <Paragraph position="14"> is relevant to the domain task at hand.</Paragraph>
      <Paragraph position="15"> However, the expressivity of the class-based approach is plainly not enough to capture the idiosyncrasies of content selection: for example, it may be the case that children's names may be worth mentioning, while grand-children's names are not. That is, in Figure 3, a45 relative person name firsta46 is dependent on a45 relative TYPEa46 and therefore, all the information in the current instance should be taken into account to decide whether a particular data path and it values should be included or not.</Paragraph>
      <Paragraph position="16"> Our approach so far simply determines that an attribute should always be included in a biography text. These examples illustrate that content selection rules should capture cases where an attribute should be included only under certain conditions; that is, only when other semantic attributes take on specific values.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 Improving Precision
</SectionTitle>
      <Paragraph position="0"> We turned to ripper6 (Cohen, 1996), a supervised rule learner categorization tool, to elucidate these types of relationships. We use as features a flattened version of the input frames,7 plus the actual value of the data in question. To obtain the right label for the training instance we do the following: for the exact-matched data paths, matched pieces of data will correspond to positive training classes, while unmatched pieces, negative ones. That is to say, if we know that a5 a0 brother agea7 a44 a29a1a0a21a9 and that a29a1a0 appears in the text, we can conclude that the data of this particular person can be used as a positive training instance for the case a5 a0 agea7 a44 a29a1a0a21a9 . Similarly, if there is no match, the opposite is inferred.</Paragraph>
      <Paragraph position="1"> For the U-test selected paths, the situation is more complex, as we only have clues about the importance of the data path as a whole. That is, while we know that a particular data path is relevant to our task (biography construction), we don't know with which values that particular data path is being verbalized. We need to obtain more information from  tures, e.g., if a person had a grandmother, then there will be a &amp;quot;grandmother&amp;quot; column for every person. This gets more complicated when list-valued values are taken into play. In our biographies case, an average-sized 100-triples biography spanned over 2,300 entries in the feature vector.</Paragraph>
      <Paragraph position="2"> the sampling process to be able to identify cases in which we believe that the relevant data path has been verbalized.</Paragraph>
      <Paragraph position="3"> To obtain finer grained information, we turned to a a11 -gram distillation process (cf. Fig. 2(4)), where the most significant a11 -grams (bi-grams in our case) were picked during the sampling process, by looking at their overall contribution to the CE term in Equation 1. For example, our system found the bi-grams screenwriter director and has screenwriter 8 as relevant for the cluster a5 a0 occupation TYPEa7 a44 c-writera9 , while the cluster a5 a0 occupation TYPEa7 a44  c-comediana44 c-actora3a38a9 will not include those, but will include sitcom Time and Comedy Musical .</Paragraph>
      <Paragraph position="4"> These a11 -grams thus indicate that the data path</Paragraph>
      <Paragraph position="6"> a change in value does affect the output. We later use the matching of these a11 -grams as an indicator of that particular instance as being selected in that document.</Paragraph>
      <Paragraph position="7"> Finally, the training data for each data path is generated. (cf. Fig. 2(5)). The selected or unselected label will thus be chosen either via direct extraction from the exact match or by means of identification of distiled, relevant a11 -grams. After ripper is run, the obtained rules are our sought content selection rules (cf. Fig. 2(5)).</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML