File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/90/j90-3005_metho.xml

Size: 20,836 bytes

Last Modified: 2025-10-06 14:12:37

<?xml version="1.0" standalone="yes"?>
<Paper uid="J90-3005">
  <Title>WORKSHOP ON THE EVALUATION OF NATURAL LANGUAGE PROCESSING SYSTEMS</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2.1 BLACK-BOX EVALUATION
</SectionTitle>
    <Paragraph position="0"> Black-box evaluation is primarily focused on &amp;quot;what a system does.&amp;quot; Ideally, it should be possible to measure performance based on well-defined I/0 pairs. If accurate output is produced with respect to particular input, then the system is performing correctly. In practice, this is more difficult than it appears. There is no consensus on how to evaluate the correctness of semantic representations, so output has to be in terms of some specific application task such as databzse answering or template fill (Sundheim 1989). This allows for an astonishing amount of variation between systems, and makes it difficult to separate out issues of coverage of linguistic phenomena from robustness and error' recovery (see Figure 2).</Paragraph>
    <Paragraph position="1"> In addition to the accuracy of the output, systems could also be evaluated in terms of their user-friendliness, modularity, portability, and maintainability. How easy are they to us.e, how well do they plug into other components, can they be ported and maintained by someone who is not a system expert? In general, it should be possible to perform  176 Computational Linguistics Volume 16, Number 3, September 1990 Martha Palmer and Tim Finin Natural Language Processing Systems Figure 2 A &amp;quot;black-box evaluation&amp;quot; is primarily focused on &amp;quot;what a system does.&amp;quot; It attempts to measure the system performance on a given task in terms of well-defined input/output pairs.</Paragraph>
    <Paragraph position="2"> black box evaluation without knowing anything about the inner workings of the system--the system can be seen as a black box, and can be evaluated by system users.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 GLASS-BOX EVALUATION
</SectionTitle>
      <Paragraph position="0"> In contrast, glass-box evaluation attempts to look inside the system, and find ways of measuring how well the system does something, rather than simply whether or not it does it. Glass-box evaluation should measure the system's coverage of a particular linguistic phenomenon or set of phenomena and the data structures used to represent them (see Figure 3). And it also should be concerned with the efficiency of the algorithms being used. Many of these tests could be performed only by a system builder. They are especially useful in attempting to measure progress when a system is under development. Glass-box evaluation should also include an examination of relevant linguistic theories and how faithfully they are implemented. If a linguistic theory does not deal with all of the data and has to be modified by the developer, those modifications need be clearly documented, and the information relayed to the Figure 3 A &amp;quot;glass-box evaluation&amp;quot; addresses &amp;quot;how the system works&amp;quot;. It attempts to look inside the system and find ways of measuring how well the system does something, rather than simply whether or not it does it.</Paragraph>
      <Paragraph position="1"> theory's developer. For example, as pointed out in Bonnie Webber's (Penn) presentation, there is a distinction between Tree Adjunction Grammar (TAG) as a linguistic theory and the several algorithms that have been used to implement TAG parsers: Extended CKY parser, Extended Earley parser, Two-pass extended Earley parser based on lexicalized TAGs, and a DCG parser using lexicalized TAGs. There is also a distinction between Centering as a theory for resolving anaphoric pronouns (Joshi and Weinstein 1981; Gross et al. 1983), and the attempts to use a centering approach to resolving pronouns in an implementation (Brennan et al. 1987).</Paragraph>
      <Paragraph position="2"> In addition, one way of looking inside a system is to look at the performance of one or more modules or components.</Paragraph>
      <Paragraph position="3"> Which components are obtained depends on the nature of the decomposition of the system. NL systems are commonly decomposed into functional modules (e.g., parser, semantic interpretation, lexical lookup, etc.), each of which performs a specified task, and into analysis phases during which different functions can be performed. A black-box evaluation of a particular component's performance could be seen as a form of glass-box evaluation. For example, taking a component such as a parser and defining a test that depends on associating specified outputs for specified inputs would be a black-box evaluation of the parser. Since it is an evaluation of a component that cannot by itself perform an application, and since it will give information about the component's coverage that is independent of the coverage of any system in which it might be embedded, this can be seen as providing glass-box information for such an overall system.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 WORKSHOP FORMAT
</SectionTitle>
    <Paragraph position="0"> The workshop began with a set of presentations that dis- null cussed evaluation methods from related fields: speech processing (Dave Pallet--NIST), machine translation (Jonathan Slocum--Symantec), and information retrieval (Dave Lewis--UMASS). This was followed by a panel on reports on evaluations of natural language systems chaired by Lynn Bates (BBN), and including John Nerbonne (HP), Debbie Dahl (Unisys), Anatole Gershman (Cognitive Systems, Inc), and Dick Kitteridge (Odyssey Re null search, Inc). After lunch Beth Sundheim (NOSC) presented the workshop with the task for the afternoon working groups, which was to discuss black-box evaluation methodologies. The groups consisted of Message Understanding chaired by Ralph Grishman (NYU), Text Understanding chaired by Lynette Hirschman (Unisys), Database Question-Answering chaired by Harry Tennant (TI), Dialogue Understanding chaired by Mitch Marcus (Penn), and Generation chaired by Ed Hovy. After reporting on the results of the working groups, the workshop met for a banquet, which included a demonstration of the Dragon speech recognition system by Jim Baker. The second day began with another presentation of a method of black-box evaluation applied to syntactic parsers, (i.e., glass-box evaluation Computational Linguistics Volume 16, Number 3, September 1990 177 Martha Palmer and Tim Finin Natural Language Processing Systems with respect to an entire system), by Fred Jelinek (IBM), and then moved on to an introduction of the topic of glass box evaluation by Martha Palmer (Unisys). A panel chaired by Jerry Hobbs (SRI), which included Mitch Marcus (Penn) and Ralph Weischedel (BBN), discussed necessary characteristics for corpora to serve as training sets and test sets for black-box evaluation of systems and components of systems. The workshop then broke up into a new set of working groups to discuss a glass-box evaluation task introduced by Bonnie Webber (Penn): syntax, chaired by Dick Kitteredge (Odyssey Research), semantics, chaired by Christine Montgomery (Language Systems, Inc.), pragmatics and discourse, chaired by Candy Sidner (DEC), knowledge representation frameworks, chaired by Tim Finin, and systems, chaired by Lynn Bates. The final session was devoted to reports of the working groups and summarization of results.</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 BLACK-BOX EVALUATION
</SectionTitle>
    <Paragraph position="0"> Beth Sundheim (NOSC) proposed a black box evaluation of message understanding systems consisting of a training set of 100 messages from a specific domain, and two separate test sets, one consisting of 20 messages and another of 10. The performance was to be evaluated with respect to a frame-filling task. There was general agreement among the workshop participants that useful black-box evaluations can be done for the message understanding and database question-answering task domains. It was also agreed that more general systems aimed at text understanding and dialogue understanding were not good candidates for black-box evaluation due to the nascent stage of their development, although individual components from such systems might benefit from evaluation. The workshop attendees were pleasantly surprised by the results of the generation group, which came up with a fairly concrete plan for comparing performance of generation systems, based on the message understanding proposal. A perennial problem with all of these proposals, with the exception of the message understanding proposal, is the lack of funding.</Paragraph>
    <Paragraph position="1"> Conferences and workshops need to be organized, systems need to be ported to the same domain so that they can be compared, etc., and there is very little financial support to make these things possible.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 MESSAGE UNDERSTANDING CONFERENCE II
</SectionTitle>
      <Paragraph position="0"> Beth Sundheim's black-box evaluation was in fact carried out last summer, June 1989, at MUCK II (Message Understanding Conference II) with quite interesting results (Sundheim 1989).</Paragraph>
      <Paragraph position="1"> It quickly became clear how important it was for systems to be able to handle partial input, a characteristic normally associated with usability. A system that could only handle 60 percent of the linguistic phenomena, but could do that in a robust fashion could receive a higher accuracy rating than a system that was capable of handling 80 percent of the linguistic phenomena, but only under ideal circumstances. The overall system performance, including many features 'that are not directly related to natural language processing, was a more important factor in the scoring than the system's linguistic coverage. Since these tests are intended 12o compare mature systems that are ready for end users, this is entirely appropriate, and is exactly what the end u,;ers are interested in. They are not concerned with how the system arrives at an answer, but simply with the answer !itself. However, tests that could provide more information about how a system achieved its results could be of more real utility to the developers of natural language systems.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 GLASS-BOX EVALUATION
</SectionTitle>
    <Paragraph position="0"> One of the primary goals of glass-box evaluations should be providing guidance to system developers--pinpointing gaps in coverage and imperfections in algorithms. The glass-box evaluation task for the workshop, as outlined by Bonnie Webber (Penn), consisted of several stages. The first stage was to define for each area a range of items that should be evaluated. The next stage was to determine which items in the ranlge were suitable for evaluation and which were not.</Paragraph>
    <Paragraph position="1"> For those that could be evaluated, appropriate methodologies (features and behaviors) and metrics (measures made on those features and behaviors) were to be specified. For items o:r areas that were not yet ready, there should be an attempt to specify the necessary steps for improving their suitability for evaluation.</Paragraph>
    <Paragraph position="2"> As explained in more detail below, the glass-box methodology most commonly suggested by the working groups was black-box evaluation of a single component. The area that seemed the ripest for evaluation was syntax, with semantics being the farthest away from the level of consensus required for general evaluation standards. Pragmatics and discourse heroically managed to specify a range of items and suggest a possible black-box evaluation methodology for a subset of those items. Knowledge representation specified subtopics with associated evaluation techniques.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.1 SYNTAX
</SectionTitle>
      <Paragraph position="0"> The most clearly defined methodology belonged to the syntax group, and has since taken shape in the form of the Treebank project, which follows many of the guidelines originally suggested by Fred Jelinek. This project will be able to evaluate syntactic parsers by comparing their output with respect to previously determined correct parse inforrnation--a black-box evaluation of a single component, i.e., a parser. The project has recently been established at the University of Pennsylvania under Mitch Marcus and is funded by DARPA, General Electric, and the Air Force. The goal of the project is to collect a large amount of data, both written language and spoken language, which will be divided into training sets and test sets. It involves annotating the data with a polytheoretic syntactic structure. It has been agreed that the annotation includes lexical class labels, bracketing, predicate argument 178 Computational Linguistics Volume 16, Number 3, September 1990 Martha Palmer and Tim Finin Natural Language Processing Systems relationships, and possibly reconstruction of control relationships, wh-gaps, and conjunction scope. Eventually it would be desirable to include co-reference anaphora, prepositional phrase attachment, and comparatives, although it is not clear how to ensure consistent annotation. People interested in testing the parsing capability of their systems against untried test data could deliver the parsers to the test site with the ability to map their output into the form of the corpus annotation for automatic testing. The test results can be returned to parser developers with overall scores as well as scores broken out by case, i.e., percentage of prepositional phrase bracketings that are correct.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.2 SEMANTICS
</SectionTitle>
      <Paragraph position="0"> One of the special difficulties in attempting to develop glass-box evaluation techniques is the lack of agreement over the content of semantic representations. Most people will agree that predicate argument relations, temporal relations, and modifiers (including prepositional phrase attachment) count as semantic phenomena, but will not agree on instructions for annotation or methodologies for evaluation. This is partly because semantics draws on so many diverse areas. People who are primarily interested in underlying cognitive structures have been accused of ignoring relations to surface syntactic phenomena. Logicians who are concentrating on building formal tools have been accused of ignoring lexical and cognitive issues, and people concerned with lexical semantics have been accused of ignoring everything but the dictionary. Some day this will all be brought together in peace and harmony, but meanwhile there are as many different styles of semantic representation as there are researchers in the field. The only possible form of comparative evaluation must be taskrelated. Good performance on such a task might be due to all sorts of factors besides the quality of the semantic representations, so it is not really an adequate discriminator. null In the Darpa Spoken Language Workshop in February 1989, Martha Palmer suggested three likely steps toward achieving more of a consensus on semantic representations:  1. Agreement on characterization of phenomena.</Paragraph>
      <Paragraph position="1"> 2. Agreement on mappings from one style of semantic representation to another.</Paragraph>
      <Paragraph position="2"> 3. Agreement on content of representations for a common  domain.</Paragraph>
      <Paragraph position="3"> An obvious choice for a common domain would be one of the MUCK domains, such as the OPREPS domain recently used for MUCK II. There are several state-of-the-art systems that are performing the same task for the same domain using quite different semantic representations. It would be useful to take four of these systems, say NYU, SRI, Unisys, and GE, and compare a selected subset of their semantic representations in depth. It should be possible to define a mapping from one style of semantic representation to another and pinpoint the various strengths and weaknesses of the different approaches. Another potential choice of domain is the Airline Guide domain. The Airline Guide task is a spoken language interface to the Official Airline Guide, where users can ask the system about flights, air fares, and other types of information about air travel.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.3 PRAGMATICS AND DISCOURSE
</SectionTitle>
      <Paragraph position="0"> The group's basic premise was that they would need a large corpus annotated with discourse phenomena. This would allow them to evaluate the effect of individual components upon the system as a whole and upon other components, such as syntax and semantics. It would also allow an individual component's behavior to be observed. They listed the discourse phenomena shown below, with the ones for which precise annotation instructions could be given marked with a *. The others might take a bit more thought. It was agreed that the topics for a subsequent meeting would include experimenting with text annotations and designing training sets and test sets.</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5.4 KNOWLEDGE REPRESENTATION
FRAMEWORKS
</SectionTitle>
    <Paragraph position="0"> This group began by pointing out that the knowledge representation and reasoning (KR&amp;R) services provided for natural language systems fall into two classes: (1) providing a meaning representation language (MRL); and (2) providing inferential services in support of syntactic, semantic, and pragmatic processing. The group noted that the MRL class should probably be broadened to include languages for representing dialogues, lexical items, etc. In addition, the group laid out a spectrum of activities, which are included in a KR&amp;R shown in Figure 4.</Paragraph>
    <Paragraph position="1"> The group suggested three evaluation methodologies.</Paragraph>
    <Paragraph position="2"> The first was aimed at evaluating a KR&amp;R system's suitability as a meaning representation language. One way to evaluate a potential MRL is to have a standard set of Computational Linguistics Volume 16, Number 3, September 1990 179 Martha Palmer and Tim Finin Natural Language Processing Systems * theory - lJ there an underlying theory which gives meaning to the xa&amp;R system? What is known about the expressiveness of the language and the computational complexity of its reasoning? * languages - How does the K~t&amp;It system function as u practical language for expressing knowledge? How easy or difficult is it to define certain concepts or relations or to specify compututions? * systems - KIt&amp;X systems are more than just nn implementation of sn underlying theory. They require good development environments: knowledge acquisition tools, debugging tools, interface technology, integration aids, etc. How extensive and good is this environment? * basic models - A KR&amp;R system often comes with some basic, domain-independent modules or models, such as temporal reasoning, spatial reasoning, naive physics, etc. Are such models available and, if they s~e, how extensive and detailed sac they? Figure 4 There are several dimensions along which a knowledge representation and reasoning system might be evaluated.</Paragraph>
    <Paragraph position="3"> natural language expressions to try to express in the MRL.</Paragraph>
    <Paragraph position="4"> This provides an evaluator with some idea of the expressiveness and conciseness of the KR&amp;R system as an MRL. A second evaluation methodology follows the &amp;quot;Consumer's Reports&amp;quot; paradigm and involves developing a checklist of features. An extensive list of KR&amp;R features could be developed for each of the dimensions given in Figure 4.</Paragraph>
    <Paragraph position="5"> Scoring how well KR&amp;R systems provide each of these features provides a way to compare different systems. The final evaluation technique is to hold a MUCK-like workshop aimed at evaluating the performance of the NLP system's underlying KR&amp;R system. The group outlined a proposal for organizing a workshop to do an evaluation of the KR&amp;R aspects of a natural language processing system based on the MUCK Workshop models.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML