File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/94/h94-1018_concl.xml

Size: 2,459 bytes

Last Modified: 2025-10-06 13:57:13

<?xml version="1.0" standalone="yes"?>
<Paper uid="H94-1018">
  <Title>TOWARDS BETTER NLP SYSTEM EVALUATION</Title>
  <Section position="9" start_page="106" end_page="106" type="concl">
    <SectionTitle>
8. Conclusion
</SectionTitle>
    <Paragraph position="0"> As noted, while there have been individual evaluations of interest, and attacks on methodology (cf Thompson, 1992), the (D)ARPA/NIST evaluation initiatives are major efforts in terms of their goals, scale, and hard labour. Relating these to my analysis and methodology their major feature is that they are laboratory experiments, and as such are naturally distanced from detailed operational influences, Moreover the desire for control, to be achieved not only by blind testing but more materially via highly elaborated and polished answer data, further emphasises their detachment. This is particularly noticeable in the MUC and SLS (ATIS) cases, where the assessment data is not realistic or representative in any strong, or at any rate demonstrated, sense. There is an assumption that the nature of the evaluation data reflects real needs, and that the relative scores obtained correctly predict relative operational utility. While SLS assessment via logfiles refers more to systems in use, current MUC evaluation concerns with 'internal' NLP products e.g. predicate-argument structures, reflect researchers' interests in fine-grained explanatory evaluation concentrating on system parameters, properly viewed heuristically, not absolutely. These are legitimate interests, but there is not enough of the necessary complementary concern, even for the laboratory approach, with environment variables and their interaction with parameters and impact on performance. TREC has the advantage of more realistic (and also more simply specified) evaluation, but there are still concerns here about legitimacy and generalisation; these instructively include what operationally relevant inferences can be drawn from test results even when these use such intuitively plausible and universal measures as recall and precision.</Paragraph>
    <Paragraph position="1"> One of the main requirements for future NLP evaluations is thus to approach these in a comprehensive as well as systematic way, so that the specific tests done are properly situated, especially in relation to the ends the evaluation subject is intended to serve, and the properties of the context in which it does this.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML