File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/96/j96-2004_abstr.xml
Size: 5,425 bytes
Last Modified: 2025-10-06 13:48:39
<?xml version="1.0" standalone="yes"?> <Paper uid="J96-2004"> <Title>Squibs and Discussions Assessing Agreement on Classification Tasks: The Kappa Statistic</Title> <Section position="2" start_page="0" end_page="0" type="abstr"> <SectionTitle> 1. Introduction </SectionTitle> <Paragraph position="0"> Computational linguistic and cognitive science work on discourse and dialogue relies on subjective judgments. For instance, much current research on discourse phenomena distinguishes between behaviors which tend to occur at or around discourse segment boundaries and those which do not (Passonneau and Litman 1993; Kowtko, Isard, and Doherty 1992; Litman and Hirschberg 1990; Cahn 1992). Although in some cases discourse segments are defined automatically (e.g., Rodrigues and Lopes' \[1992\] definition based on temporal relationships), more usually discourse segments are defined subjectively, based on the intentional structure of the discourse, and then other phenomena are related to them. At one time, it was considered sufficient when working with such judgments to show examples based on the authors' interpretation (paradigmatically, (Grosz and Sidner \[1986\], but also countless others). Research was judged according to whether or not the reader found the explanation plausible. Now, researchers are beginning to require evidence that people besides the authors themselves can understand, and reliably make, the judgments underlying the research. This is a reasonable requirement, because if researchers cannot even show that people can agree about the judgments on which their research is based, then there is no chance of replicating the research results. Unfortunately, as a field we have not yet come to agreement about how to show reliability of judgments. For instance, consider the following arguments for reliability. We have chosen these examples both for the clarity of their arguments and because, taken as a set, they introduce the full range of issues we wish to discuss.</Paragraph> <Paragraph position="1"> . Kowtko, Isard, and Doherty (1992; henceforth KID), in arguing that it is possible to mark conversational move boundaries, cite separately for each of three naive coders the ratio of the number of times they agreed with an &quot;expert&quot; coder about the existence of a boundary over the number of times either the naive coder or the expert marked a boundary.</Paragraph> <Paragraph position="2"> They do not describe any restrictions on possible boundary sites.</Paragraph> <Paragraph position="3"> Human Communication Research Centre, 2 Buccleuch Place, Edinburgh EH8 9LW, Scotland Computational Linguistics Volume 22, Number 2</Paragraph> <Paragraph position="5"> Once conversational move boundaries have been marked on a transcript, KID argue that naive coders can reliably place moves into one of thirteen exclusive categories. They cite pairwise agreement percentages figured over all thirteen categories, again looking at each of the three naive coders separately. Litman and Hirschberg (1990) use this same pairwise technique for assessing the reliability of cue phrase categorization, using two equal-status coders and three categories.</Paragraph> <Paragraph position="6"> Silverman et al. (1992), in arguing that sets of coders can agree on a range of category distinctions involved in the TOBI system for labeling English prosody, cite the ratio of observed agreements over possible agreements, measuring over all possible pairings of the coders. For example, they use this measure for determining the reliability of the existence and category of pitch accents, phrase accents, and boundary tones. They measure agreement over both a pool of highly experienced coders and a larger pool of mixed-experience coders, and argue informally that since the level of agreement is not much different between the two, their coding system is easy to learn.</Paragraph> <Paragraph position="7"> Passonneau and Litman (1993), in arguing that naive subjects can reliably agree on whether or not given prosodic phrase boundaries are also discourse segment boundaries, measure reliability using &quot;percent agreement,&quot; defined as the ratio of observed agreements with the majority opinion among seven naive coders to possible agreements with the majority opinion.</Paragraph> <Paragraph position="8"> Although (1) and KID's use of (2) differ slightly from Litman and Hirschberg's use of (2), (3) and (4) in clearly designating one coder as an &quot;expert,&quot; all of these studies have n coders place some kind of units into m exclusive categories. Note that the cases of testing for the existence of a boundary can be treated as coding &quot;yes&quot; and &quot;no&quot; categories for each of the possible boundary sites; this treatment is used by measures (3) and (4) but not by measure (1). All four approaches seem reasonable when taken at face value. However, the four measures of reliability bear no relationship to each other. Worse yet, since none of them take into account the level of agreement one would expect coders to reach by chance, none of them are interpretable even on their own. We first explain what effect chance expected agreement has on each of these measures, and then argue that we should adopt the kappa statistic (Siegel and Castellan 1988) as a uniform measure of reliability.</Paragraph> </Section> class="xml-element"></Paper>