File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/01/w01-0905_concl.xml
Size: 2,937 bytes
Last Modified: 2025-10-06 13:53:06
<?xml version="1.0" standalone="yes"?> <Paper uid="W01-0905"> <Title>Two levels of evaluation in a complex NL system</Title> <Section position="8" start_page="5" end_page="5" type="concl"> <SectionTitle> 7 Conclusion and perspectives </SectionTitle> <Paragraph position="0"> Each evaluation reflects a viewpoint, underlying the criterion we use. In our case, the choice of criteria was guided by the existence of two main stages in the QA process, namely the selection of relevant documents and the selection of the answer among the selected documents sentences. Sometimes, such criteria concur in Among the 42 runs using 250 byte limit, submitted at TREC9-QA, only seven found the correct answer at rank 1, and 27 do not found it.</Paragraph> <Paragraph position="1"> 22 runs, out of 42 found the right answer at rank 1. Only 9 were unable to find it.</Paragraph> <Paragraph position="2"> revealing the same positive or negative feature of the system. They can also yield a more precise assessment of the reasons behind these features, as was the case in our evaluation of the ranker. Moreover, when a system consists of several modules, their specific evaluations should imply different criteria.</Paragraph> <Paragraph position="3"> This is particularly true in dialogue systems, where different kinds of processes are cooperating. Since information retrieval is an interactive task, it seems natural to associate a dialogue component to it. Indeed, users tend to ask a question, evaluate the answer, and reformulate their question to make it more specific (or, contrariwise, more general, or quite different). A QA system is, therefore, a good applicative setting for a dialogue module.</Paragraph> <Paragraph position="4"> Quantitative assessment of the QA system would be useful in assessing the dialogue system in this particular context. Such a global assessment would provide an objective judgement about whether the task (finding the answer) was achieved, or not. Successfulness in a task is a necessary component of the evaluation, nevertheless it is just a part of it. Obviously, dialogue evaluation is also a matter of cost (time, number of exchanges) and of user-friendliness (cognitive ergonomy).</Paragraph> <Paragraph position="5"> However, objectivity is almost impossible to attain in these domains. In a recent debate (LREC 2000), serious objections about natural language tools evaluation and validation were developed e.g. by Sabah (2000). The main issue he raises is about the great complexity of such systems. However, we consider that by going as far as possible in the experimental search for evaluation criteria, we also make a meaningful contribution to this debate. While it is true that complexity should never be ignored, we consider that, by successive approximate modelisation and evaluation cycles, we can capture some of it at each step of our systems developement.</Paragraph> </Section> class="xml-element"></Paper>