File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/89/p89-1031_intro.xml
Size: 4,804 bytes
Last Modified: 2025-10-06 14:04:49
<?xml version="1.0" standalone="yes"?> <Paper uid="P89-1031"> <Title>EVALUATING DISCOURSE PROCESSING ALGORITHMS</Title> <Section position="3" start_page="0" end_page="251" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> In the course of developing natural language interfaces, computational linguists are often in the position of evaluating different theoretical approaches to the analysis of natural language (NL). They might want to (a) evaluate and improve on a current system, (b) add a capability to a system that it didn't previously have, (c) combine modules from different systems.</Paragraph> <Paragraph position="1"> Consider the goal of adding a discourse component to a system, or evaluating and improving one that is already in place. A discourse module might combine theories on, e.g., centering or local focusing \[GJW83, Sid79\], global focus \[Gro77\], coherence relations\[Hob85\], event&quot; reference \[Web86\], intonational structure \[PH87\], system vs. user beliefs \[Po186\], plan or intent recognition or production \[(3o578, AP86, SIS1\], control\[WSSS\], or complex syntactic structures \[Pri85\]. How might one evaluate the relative contributions of each of these factors or compare two approaches to the same problem? In order to take steps towards establishing a methodology for doing this type of comparison, we conducted a case study. We attempt to evaluate two different approaches to anaphoric processing in discourse by comparing the accuracy and coverage of two published algorithms for finding the co-specifiers of pronouns in naturally occurring texts and dialogues\[Hob76b, BFP87\]. Thus there are two parts to this paper: we present the quantitative results of hand-simulating these algorithms (henceforth Hobbs algorithm and BFP algorithm), but this analysis naturally gives rise to both a qualitative evaluation and recommendations for performing such evaluations in general. We illustrate the general difficulties encountered with quantitative evaluation. These are problems with: (a) allowing for underlying assumptions, (b) determining how to handle underspecifications, and (c) evaluating the contribution of false positives and error chaining.</Paragraph> <Paragraph position="2"> Although both algorithms are part of theories of discourse that posit the interaction of the algorithm with an inference or intentional component, we will not use reasoning in tandem with the algorithm's operation. We have made this choice because we want to be able to analyse the performance of the algorithms across different domains. We focus on the linguistic basis of these approaches, using only selectional restrictions, so that our analysis is independent of the vagaries of a particular knowledge representation. Thus what we are evaluating is the extent to which these algorithms suffice to narrow the search of an inference component I. This analysis gives us l But note the definition of success in section 2.1.</Paragraph> <Paragraph position="3"> some indication of the contribution of syntactic constraints, task structure and global focus to anaphoric processing.</Paragraph> <Paragraph position="4"> The data on which we compare the algorithms are important if we are to evaluate claims of generality. If we look at types of NL input, one clear division is between textual and interactive input. A related, though not identical factor is whether the language being analysed is produced by more than one person, although this distinction may be confluted in textual material such as novels that contain reported conversations. Within two-person interactive dialogues, there are the task-oriented masterslave type, where all the expertise and hence much of the initiative, rests with one person. In other two-person dialogues, both parties may contribute discourse entities to the conversation on a more equal basis. Other factors of interest are whether the dialogues are human-to-human or human-to-computer, as well as the modality of communication, e.g. spoken or typed, since some researchers have indicated that dialogues, and particularly uses of reference within them, vary along these dimensions \[Coh84, Tho80, GSBC86, D J89, WS89\].</Paragraph> <Paragraph position="5"> We analyse the performance of the algorithms on three types of data. Two of the samples are those that Hobbs used when developing his algorithm. One is an excerpt from a novel and the other a sample of journalistic writing. The remaining sample is a set of 5 human-human, keyboard-mediated, task-oriented dialogues about the assembly of a plastic water pump \[Coh84\]. This covers only a subset of the above types. Obviously it would be instructive to conduct a similar analysis on other textual types.</Paragraph> </Section> class="xml-element"></Paper>