File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/89/e89-1016_metho.xml
Size: 12,154 bytes
Last Modified: 2025-10-06 14:12:20
<?xml version="1.0" standalone="yes"?> <Paper uid="E89-1016"> <Title>User studies and the design of Natural Language Systems</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> I.I.I Test suites </SectionTitle> <Paragraph position="0"> One method that is often used in computer science for the evaluation of systems is the use of test suites. For NL systems the idea is to generate a corpus of sentences which contains the major set of syntactic, se- 116mantic and pragmatic phenomena the system should cover \[BB84, FNSW87\]. One problem with this approach is how we determine whether the test set is complete. Do we have a clear notion of what constitute the major phenomena of language so that we can generate test sentences which identify whether these have been analysed correctly? Theories of syntax are well developed and may provide us with a good taxonomy of syntactic phenomena, but we do not have similar classifications of key pragmatic requirements.</Paragraph> <Paragraph position="1"> There are two reasons why current approaches may fail to identify the key phenomena. Current test sets are organised on a single-utterance basis, with certain exceptions such as intersentential anaphora and ellipsis. Now it may be that more complex discourse phenomena such as reference to dialogue structure arise when systems are being used to carry out tasks, because of the need to construct and manipulate sets of information \[McK84\]. In addition, context may contribute to inputs being fragmentary or telegraphic in style. Unless we investigate systems being used to carry out tasks, such phenomena will continue to he omitted from our test suites and NL systems will have to be substantially modified when they are connected to their backend systems. Thus we are not arguing against the use of test suites in principle but rather are attempting to determine what methodology should be used to design such test suites.</Paragraph> <Paragraph position="2"> In field studies, subjects are given the NL interface connected to some application and encouraged to make use of it. It would seem that these studies would offer vital information about target requirements. Despite arguments that such studies are highly necessary \[Ten79\], few systematic studies have been conducted \[Dam81, JTS*85, Kra80\]. The problem here may be with finding committed users who are prepared to make serious use of a fragile system.</Paragraph> <Paragraph position="3"> A major problem with such studies concerns the robustness of the systems which were tested and this leads to difficulties in the interpretation of the results. This is because a fragile system necessarily imposes limitations on the ways that a user can interact with it. We cannot therefore infer that the set of sentences that users input when they have adjusted to a fragile system, reflects the set of inputs that they would wish to enter given a system with fewer limitations.</Paragraph> <Paragraph position="4"> In other words we cannot infer that such inputs represent the way that users would ideally wish to interact using NL. The users may well have been employing strategies to communicate within the limitations of the system and they may therefore have been using a highly restricted form of English. Indeed the existence of strategies such as paraphrasing and syntax simplification when a query failed, and repetition of input syntax when a query succeeded has been documented \[ThoS0, WW89\].</Paragraph> <Paragraph position="5"> Since we cannot currently envisage a system without limitations, we may want to exploit this ability to learn system limitations, nevertheless the existence of such user strategies does not give us a clear view of what language might have been used in the absence of these limitations.</Paragraph> <Paragraph position="6"> 1.1.3 Pen and paper tasks One technique which overcomes some of the problems of robustness has been to use pen and paper tasks. Here we do not use a system at all but rather give subjects what is essentially a translation task \[JTS*85, Mil81\]. This technique has also been employed to evaluate formal query languages such as SQL. The subjects of the study are given a sample task: A list of alumni in the state of California has been requested. The request applies to those alumni whose last name starts with an S. Obtain such a list containing last names and first names. When the subjects have generated their natural language query, it is evaluated by judges to determine whether it would have successfully elicited the information from the system.</Paragraph> <Paragraph position="7"> This approach avoids the problem of using fragile systems, but it is susceptible to the same objections as were levelled at test suites: a potential drawback with the approach concerns the representativeness of the set of tasks the users are required to do when they carry out the translation tasks. For the tasks described by Reisner, for example, the queries are all one shot, i.e. they are attempts to complete a task in a single query \[Rei77\]. As a result the translation problems may fail to test the system's coverage of discourse phenomena.</Paragraph> <Paragraph position="8"> A similar technique to pen and paper tasks has been the use of a method called the &quot;Wizard of Or&quot; (henceforth WOZ) which also avoids the problem of the fragility of current systems by simulating the operation of the system rather than using the system itself. - 117-In these studies, subjects are told that they are interacting with the computer when in reality they are linked to the Wizard, a person simulating the operation of the system, over a computer network.</Paragraph> <Paragraph position="9"> In Guindon's study using the WOZ technique, subjects were told they were using an NL front-end to a knowledge-based statistics advisory package \[GSBC86\]. The main result is a counterintuitive one.</Paragraph> <Paragraph position="10"> These studies suggest that people produce &quot;simple language&quot; when they believe that they are using an NL interface. Guindon has compared the WOZ dialogues of users interacting with the statistics package, to informal speech, and likened them to the simplified register of &quot;baby talk&quot; \[SF7?\]. In comparison with informal speech, the dialogues have few passives, few pronouns and few examples of fragmentary speech.</Paragraph> <Paragraph position="11"> One problem with the research is that it has been descriptive: It has chiefly been concerned with demonstrating the fact that the language observed is &quot;simple&quot; relative to norms gathered for informal and written speech and the results are expressed at too general a level to be useful for system design.</Paragraph> <Paragraph position="12"> It is not enough to know, for example, that there are fewer fragments observed in WOZ type dialogues than in informal speech: it is necessary to know the precise characteristics of such fragments if we are to design a system to analyse these when they occur.</Paragraph> <Paragraph position="13"> Despite this, our view is that WOZ represents the most promising technique for identifying the target requirements of an NL interface. However, to avoid the problem of precision described above, we modified the technique in one significant respect. Having used the WOZ technique to generate a set of sentences that users ideally require to carry out a database retrieval task, we then input these sentences into a NL system linked to the database. The target requirements are therefore evaluated against a version of a real system and we can observe the ways in which the system satisfies, or fails to satisfy, user requirements.</Paragraph> <Paragraph position="14"> We discuss semantics and pragmatics only insofar as they are reflected in individual lexical items. This is of some importance, given the lexical basis of the HPNL system. It must also be noted that the evaluation took place against a prototype version of HPNL. Many of the lexical errors we encountered could be removed with a trivial amount of effort. Our interest was not therefore in the absolute number of such errors, but rather with the general classes of lexical errors which arose. We present a classification of such errors below.</Paragraph> <Paragraph position="15"> The task we investigated was database retrieval.</Paragraph> <Paragraph position="16"> This was predominantly because this has been a typical application for NL interfaces. Our initial interest was in the target requirements for an NL system, i.e. what set of sentences users would enter if they were given no constraints on the types of sentences that they could input. The Wizard was therefore instructed to answer all questions (subject to the limitation given below). We ensured that this person had sufficient information to answer questions about the database, and so in principle, the system was capable of handling all inputs.</Paragraph> <Paragraph position="17"> The subjects were asked to access information from the &quot;database&quot; about a set of paintings which possessed certain characteristics. The database contained information about Van Gogh's paintings including their age, theme, medium, and location. The subjects had to find a set of paintings which together satisfied a series of requirements, and they did this by typing English sentences into the machine. They were not told exactly what information the database contained, nor about the set of inputs the Natural Language interface might be capable of processing.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Method 1.2 The current study </SectionTitle> <Paragraph position="0"> The current study therefore has two components: the first is a WOZ study of dialogues involved in database retrieval tasks. We then take the recorded dialogues and map them onto the capabilities of an existing system, HPNL \[NP88\] to look at where the language that the users produce goes beyond the capabilities of this system. The results we present concern the first phase of such an analysis in which we discuss the set of words that the system failed to analyse.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Subjects </SectionTitle> <Paragraph position="0"> The 12 subjects were all familiar with using computers insofar as they had used word processors and electronic mail. A further 5 of them had used omce applications such as spreadsheets or graphics packages. Of the remainder, 4 had some experience with using databases and one of these had participated in the design a database. None of them was familiar with the current state of NL technology.</Paragraph> <Paragraph position="1"> - 118-</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Procedure hard copy. </SectionTitle> <Paragraph position="0"> The experimenter told the subjects that he was interested in evaluating the efficiency of English as a medium for communicating with computers. He told them that an English interface to a database was running on the machine and that the database contained information about paintings by Van Gogh and other artists. In fact this was not true: the information that the subjects typed into the terminal was transmitted to a person (The Wizard) at another terminal who answered the subject's requests by consulting paper copies of the database tables.</Paragraph> <Paragraph position="1"> The experimenter then gave the details of the two tasks. Subjects were told that they had to find a set of paintings which satisfied several requirements, where a requirement might be for example that (a) all the paintings must come from different cities; or (b) they must all have different themes. Having found this set, they had then to access particular information about the set of pictures that they had chosen, e.g the paint medium for each of the pictures chosen.</Paragraph> </Section> </Section> class="xml-element"></Paper>