File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-1422_metho.xml
Size: 6,712 bytes
Last Modified: 2025-10-06 14:10:43
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-1422"> <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics GENEVAL: A Proposal for Shared-task Evaluation in NLG</Title> <Section position="5" start_page="136" end_page="137" type="metho"> <SectionTitle> 3 Our Proposal </SectionTitle> <Paragraph position="0"> We intend to apply for funding for a three-year projecttocreatemoresharedinput/outputdatasets (we are focusing on data-to-text tasks for the reasons discussed in Belz and Kilgarriff (2006)), organise shared task workshops, and create and test a range of methods for evaluating submitted systems. null</Paragraph> <Section position="1" start_page="136" end_page="136" type="sub_section"> <SectionTitle> 3.1 Step 1: Create data sets </SectionTitle> <Paragraph position="0"> We intend to create input/output data sets that contain the following types of representations: The presence of intermediate representations in our data sets means that researchers who are just interested in document planning, microplanning, or surface realisation do not need to build complete NLG systems in order to participate.</Paragraph> <Paragraph position="1"> We will create the semantic-level representations by parsing the corpus texts, probably using a LinGO parser1. We will create the content representations using application-specific analysis tools, similar to a tool we have already created for SumTime wind statements. The actual data sets we currently intend to create are as follows (see also summary in Table 1).</Paragraph> <Paragraph position="2"> SumTime weather statements: These are brief statements which describe predicted precipitation and cloud over a forecast period. We will extract the texts (and the corresponding input data) from the existing SumTime corpus.</Paragraph> <Paragraph position="3"> Statistics summaries: We will ask people (probably students) to write paragraph-length textual summaries of statistical data. The actual data will come from opinion polls or national statistics offices. The corpus will also include data about the authors (e.g., age, sex, domain expertise).</Paragraph> <Paragraph position="4"> Nurses' reports: As part of a new project at Aberdeen, Babytalk2, we will be acquiring a corpus of texts written by nurses to summarise the status of a baby in a neonatal intensive care unit, along with the raw data this is based on (sensor readings, records of actions taken such as giving medication). null</Paragraph> </Section> <Section position="2" start_page="136" end_page="136" type="sub_section"> <SectionTitle> 3.2 Step 2: Organise workshops </SectionTitle> <Paragraph position="0"> The second step is to organise workshops. We intend to use a fairly standard organisation (Belz and Kilgarriff, 2006). We will release the data sets (but not the reference texts), give people six months to develop systems, and invite people who submit systems to a workshop. Participants can submit either complete data-to-text NLG systems, or components which just do document planning, microplanning, or realisation.</Paragraph> <Paragraph position="1"> We are planning to increase the number and complexity of tasks from one round to the next, as this has been useful in other NLP evaluations (Belz and Kilgarriff, 2006); for example, we will add surface realisation as a separate task in round 2 and layout/structuring task in round 3.</Paragraph> <Paragraph position="2"> We will carry out all evaluation activities (see below) ourselves, workshop participants will not be involved in this.</Paragraph> </Section> <Section position="3" start_page="136" end_page="137" type="sub_section"> <SectionTitle> 3.3 Step 3: Evaluation </SectionTitle> <Paragraph position="0"> The final step is to evaluate the systems and components submitted to the workshop. As the main Corpus num texts num ref (*) text size main NLG challenges Weather statements 3000 300 1-2 sentences content det, lex choice, aggregation Statistical summaries 1000 100 paragraph above plus surface realisation Nurses' reports 200 50 several paras above plus text structuring/layout (*) In addition to the main corpus, we will also gather texts which will be used as reference texts for corpus-based evaluations; 'num ref' is the number of such texts. These texts will not be released. purpose of this whole exercise is to see how well different evaluation techniques correlate with each other, we plan to carry out a range of different evaluations, including the following.</Paragraph> <Paragraph position="1"> Corpus-based evaluations: We will develop new, linguistically grounded evaluation metrics, and compare these to existing metrics including BLEU, NIST, and string-edit distance. We will also investigate how sensitive different metrics are to size and make-up of the reference corpus.</Paragraph> <Paragraph position="2"> Human-based preference judgements: We will investigate different experimental designs and methods for overcoming respondent bias (e.g.</Paragraph> <Paragraph position="3"> what is known as 'central tendency bias', where some respondents avoid judgements at either end of a scale). As we showed previously (Belz and Reiter,2006)thattherearesignificantinter-subject differences in ratings, one thing we want to determine is how many subjects are needed to get reliable and reproducible results.</Paragraph> <Paragraph position="4"> Task performance. This depends on the domain, but e.g. in the nurse-report domain we could use the methodology of (Law et al., 2005), whoshowedmedicalprofessionalsthetexts, asked them to make a treatment decision, and then rated the correctness of the suggested treatments.</Paragraph> <Paragraph position="5"> As well as recommendations about the appropriateness of existing evaluation techniques, we hope the above experiments will allow us to suggest new evaluation techniques for NLG.</Paragraph> </Section> </Section> <Section position="6" start_page="137" end_page="137" type="metho"> <SectionTitle> 4 Next Steps </SectionTitle> <Paragraph position="0"> At this point, we encourage NLG researchers to give us their views regarding our plans for the organisation of GENEVAL, the data and evaluation methods we are planning to use, to suggest additional data sets or evaluation techniques, and especially to let us know whether they would be interested in participating.</Paragraph> <Paragraph position="1"> If our proposal is successful, we hope that the project will start in summer 2007, with the first data set released in late 2007 and the first workshop in summer 2008. ELRA/ELDA have also already agreed to help us with this work, contributing human and data resources.</Paragraph> </Section> class="xml-element"></Paper>