File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/90/j90-3005_concl.xml

Size: 4,199 bytes

Last Modified: 2025-10-06 13:56:32

<?xml version="1.0" standalone="yes"?>
<Paper uid="J90-3005">
  <Title>WORKSHOP ON THE EVALUATION OF NATURAL LANGUAGE PROCESSING SYSTEMS</Title>
  <Section position="9" start_page="0" end_page="0" type="concl">
    <SectionTitle>
6 WORKSHOP CONCLUSIONS
</SectionTitle>
    <Paragraph position="0"> Several concerete results came out of the workshop. In particular, a consensus was reached on the black-box evaluation task for the second Message Understanding Conference (MUCK II), and a consensus was also reached on the desirability of a common corpus of annotated language, both written and spoken, that could be used for training and testing purposes. Since the workshop, MUCK II has been held with interesting and useful results, and the Treebank project at the University of Pennsylvania has received funding and has begun. This should eventually lead to a more formalized testing and comparisons of parsers. Evaluation is becoming a more prevalent topic at NL workshops, such as the one held at RADC in September 1989, and the Darpa Spoken Language Community is working hard to construct a general evaluation procedure for the various contractors. However, most of the other specific workshops suggested, such as Database Question-Answering, Generation, Knowledge Representation, and Pragmatics and Discourse do not have any funding sources available. The most difficult problems remain unresolved.</Paragraph>
    <Paragraph position="1"> There ,are still large classes of phenomena that have yet to be characterized in a scholarly fashion, and we do not have adequa.te methods for measuring progress of a system under development.</Paragraph>
    <Paragraph position="2"> A fundamental underlying snag is the difficulty in arriving al; a consensus on the nature of semantic representation. If the community was in agreement on what the representation of a sentence is supposed to be--whether it was a sentence from a dialogue with an expert system, a sentence fragment from a tactical message, or a database query-then the task of assessing a system's performance would be much more straightforward. Given input X, does the system produce Y as an internal data structure? Unfortunately, there are now as many Y's for X as there are systems, so finding a reliable method of assessing a system in isolation, or of comparing two systems, becomes much more clifficult. It is necessary to define the evaluation in term:; of a task that is being performed (Sundheim 1989; Napier 1989). Then the system's score with respect to natural language competence becomes dependent on how well 'the system as a whole can elicit information from the expert system or the database, or can summarize the information in the message. Task-oriented black-box evaluations are useful and valid, and are certainly of primary concern to the end users who need the information, and do not rea.lly care how it is produced. But there are drawbacks in depending solely on this approach. A system's capabilities cannot be measured or compared until it has been completely integrated with a target application. For any interesting application, this requires a major investment in a dornain model and in a domain semantics, not to mention all of tlhe application-specific needs around user friendliness and informative displays of information, etc. Designing the task-oriented test can require a major investment as well (Sundlheim 1989). This is an extremely expensive and time--consuming enterprise that few organizations can indulge Jln. The result is that there are very few systems that are fully integrated with target applications in such a way that an appropriate task-oriented evaluation can be performed. There is no way to test whether or not a system is suitab'~te for a particular application without actually building the application. There are no accepted guidelines that system developers can use to measure the progress being made by a fledgling system from month to month. Granted that a task-oriented evaluation is necessary and sufficient for a system that is ready for end-users, it does not solve the problem of charting a system's progress along the way toward a particular application.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML