File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/03/w03-1016_evalu.xml
Size: 5,817 bytes
Last Modified: 2025-10-06 13:59:03
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-1016"> <Title>Statistical Acquisition of Content Selection Rules for Natural Language Generation</Title> <Section position="5" start_page="0" end_page="0" type="evalu"> <SectionTitle> 4 Experiments </SectionTitle> <Paragraph position="0"> We used the following experimental setting: 102 frames were separated from the full set together with their associated 102 biographies from the biography.com site. This constituted our development corpus. We further split that corpus into development training (91 people) and development test and hand-tagged the 11 document-data pairs.</Paragraph> <Paragraph position="1"> The annotation was done by one of the authors, by reading the biographies and checking which triples (in the RDF sense, a5 frame, slot, valuea9 ) were actually mentioned in the text (going back and forth to the biography as needed). Two cases required special attention. The first one was aggregated information, e.g., the text may say &quot;he received three 8Our bi-grams are computed after stop-words and punctuation is removed, therefore these examples can appear in texts like &quot;he is an screenwriter,director,. . . &quot; or &quot;she has an screenwriter award. . .</Paragraph> <Paragraph position="2"> Grammys&quot; while in the semantic input each award was itemized, together with the year it was received, the reason and the type (Best Song of the Year, etc.).</Paragraph> <Paragraph position="3"> In that case, only the name of award was selected, for each of the three awards. The second case was factual errors. For example, the biography may say the person was born in MA and raised in WA, but the fact-sheet may say he was born in WA. In those cases, the intention of the human writer was given priority and the place of birth was marked as selected, even though one of the two sources were wrong. The annotated data total 1,129 triples. From them, only 293 triples (or a 26%) were verbalized in the associated text and thus, considered selected.</Paragraph> <Paragraph position="4"> That implies that the &quot;select all&quot; tactic (&quot;select all&quot; is the only trivial content selection tactic, &quot;select none&quot; is of no practical value) will achieve an F-measure of 0.41 (26% prec. at 100% rec.).</Paragraph> <Paragraph position="5"> Following the methods outlined in Section 3, we utilized the training part of the development corpus to mine Content Selection rules. We then used the development test to run different trials and fit the different parameters for the algorithm. Namely, we determined that filtering the bottom 1,000 TF*IDF weighted words from the text before building the a11 -gram model was important for the task (we compared against other filtering schemes and the use of lists of stop-words). The best parameters found and the fitting methodology are described in (Duboue and McKeown, 2003b).</Paragraph> <Paragraph position="6"> We then evaluated on the rest of the semantic input (998 people) aligned with one other textual corpus (imdb.com). As the average length-perbiography are different in each of the corpora we worked with (450 and 250, respectively), the content selection rules to be learned in each case were different (and thus, ensure us an interesting evaluation of the learning capabilities). In each case, we split the data into training and test sets, and hand-tagged the test set, following the same guidelines explained for the development corpus. The linkage step also required some work to be done. We were able to link 205 people in imdb.com and separated 14 of them as the test set.</Paragraph> <Paragraph position="7"> The results are shown in Table 19. Several 9We disturbed the dataset to obtain some cross-validation over these figures, obtaining a std dev. of 0.02 for the F*, the full details are given in (Duboue and McKeown, 2003b).</Paragraph> <Paragraph position="8"> It says that the subtitle of the award (e.g., &quot;Best Director&quot;, for an award with title &quot;Oscar&quot;) should be selected if the person is a director who studied in the US and the award is not of Festival-type.</Paragraph> <Paragraph position="9"> things can be noted in the table. The first is that imdb.com represents a harder set than our development set. That is to expect, as biography.com's biographies have a stable editorial line, while imdb.com biographies are submitted by Internet users. However, our methods offer comparable results on both sets. Nonetheless, the tables portray a clear result: the class-based rules are the ones that produce the best overall results. They have the highest F-measure of all approaches and they have high recall. In general, we want an approach that favors recall over precision in order to avoid losing any information that is necessary to include in the output. Overgeneration (low precision) can be corrected by later processes that further filter the data. Further processing over the output will need other types of information to finish the Content Selection process. The class-based rules filter-out about 50% of the available data, while maintaining the relevant data in the output set.</Paragraph> <Paragraph position="10"> An example rule from the ripper approach can be seen in Figure 4. The rules themselves look interesting, but while they improve in precision, as was our goal, their lack of recall makes their current implementation unsuitable for use. We have identified a number of changes that we could make to improve this process and discuss them at the end of the next section. Given the experimental nature of these results, we would not yet draw any conclusions about the ultimate benefit of the ripper approach.</Paragraph> </Section> class="xml-element"></Paper>