File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/90/j90-3005_intro.xml
Size: 3,780 bytes
Last Modified: 2025-10-06 14:04:55
<?xml version="1.0" standalone="yes"?> <Paper uid="J90-3005"> <Title>WORKSHOP ON THE EVALUATION OF NATURAL LANGUAGE PROCESSING SYSTEMS</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 INTRODUCTION </SectionTitle> <Paragraph position="0"> In the past few years, the computational linguistics research community has begun to wrestle with the problem of how to evaluate its progress in developing natural language processing systems. With the exception of natural language interfaces, there are few working systems in existence, and they tend to focus on very different tasks using equally different techniques. There has been little agreement in the field about training sets and test sets, or about clearly defined subsets of problems that constitute standards for different levels of performance. Even those groups that have attempted a measure of self-evaluation have often been reduced to discussing a system's performance in isolation---comparing its current performance to its previous performance rather than to another system. As this technology begins to move slowly into the marketplace, the lack of useful evaluation techniques is becoming more and more painfully obvious.</Paragraph> <Paragraph position="1"> In order to make progress in the difficult area of natural language evaluation, a Workshop on the Evaluation of Natural Language Processing Systems was held on December 7-9, 1988 at the Wayne Hotel in Wayne, Pennsylvania.</Paragraph> <Paragraph position="2"> The workshop was organized by Martha Palmer (Unisys), assisted by a program committee consisting of Beth Sundheim (NOSC), Ed Hovy (ISI), Tim Finin (Unisys), Lynn Bates (BBN), and Mitch Marcus (Pennsylvania). Approximately 50 people participated, drawn from universities, industry, and government. The workshop received the generous support of the Rome Air Defense Center, the Association of Computational Linguistics, the American Association of Artificial Intelligence, and Unisys Defense Systems. The workshop was organized along two basic premises.</Paragraph> <Paragraph position="3"> First, it should be possible to discuss system evaluation in general without having to state whether the purpose of the system is &quot;question-answering&quot; or &quot;text processing.&quot; Evaluating a system requires the definition of an application task in terms of input/output pairs that are equally applicable to question-answering, text processing, or generation. Second, there are two basic types of evaluation, black-box evaluation, which measures system performance on a given task in terms of well-defined input/output pairs, and glass-box evaluation, which examines the internal workings of the system. For example, glass-box performance evaluation for a system that is supposed to perform semantic and pragmatic analysis should include the examination of predicate-argument relations, referents, and temporal and causal relations. Since there are many different stages of development that a natural language system passes through before it is in a state where black-box evaluation is even possible (see Figure 1), glass-box evaluation plays an especially important role in guiding the development at early stages.</Paragraph> <Paragraph position="4"> With these premises in mind, the workshop was structured around the following three sessions: (i) defining the notions of &quot;black-box evaluation&quot; and &quot;glass-box evaluation&quot; and exploring their utility; (ii) defining criteria for &quot;black-box evaluation&quot;; and (iii) defining criteria for &quot;glass-box evaluation.&quot; It was hoped that the workshop would shed light on the following questions.</Paragraph> </Section> class="xml-element"></Paper>