File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/96/j96-2004_metho.xml

Size: 13,209 bytes

Last Modified: 2025-10-06 14:14:20

<?xml version="1.0" standalone="yes"?>
<Paper uid="J96-2004">
  <Title>Squibs and Discussions Assessing Agreement on Classification Tasks: The Kappa Statistic</Title>
  <Section position="3" start_page="0" end_page="251" type="metho">
    <SectionTitle>
2. Chance Expected Agreement
</SectionTitle>
    <Paragraph position="0"> Measure (2) seems a natural choice when there are two coders, and there are several possible extensions when there are more coders, including citing separate agreement figures for each important pairing (as KID do by designating an expert), counting a unit as agreed only if all coders agree on it, or measuring one agreement over all possible pairs of coders thrown in together. Taking just the two-coder case, the amount of agreement we would expect coders to reach by chance depends on the number and relative proportions of the categories used by the coders. For instance, consider what happens when the coders randomly place units into categories instead of using an established coding scheme. If there are two categories occurring in equal proportions, on average the coders would agree with each other half of the time: each time the second coder makes a choice, there is a fifty/fifty chance of coming up with the same</Paragraph>
    <Section position="1" start_page="250" end_page="251" type="sub_section">
      <SectionTitle>
Carletta Assessing Agreement
</SectionTitle>
      <Paragraph position="0"> category as the first coder. If, instead, the two coders were to use four categories in equal proportions, we would expect them to agree 25% of the time (since no matter what the first coder chooses, there is a 25% chance that the second coder will agree.) And if both coders were to use one of two categories, but use one of the categories 95% of the time, we would expect them to agree 90.5% of the time (.952 4- .052 , or, in words, 95% of the time the first coder chooses the first category, with a .95 chance of the second coder also choosing that category, and 5% of the time the first coder chooses the second category, with a .05 chance of the second coder also doing so).</Paragraph>
      <Paragraph position="1"> This makes it impossible to interpret raw agreement figures using measure (2). This same problem affects all of the possible ways of extending measure (2) to more than two coders.</Paragraph>
      <Paragraph position="2"> Now consider measure (3), which has an advantage over measure (2) when there is a pool of coders, none of whom should be distinguished, in that it produces one figure that sums reliability over all coder pairs. Measure (3) still falls foul of the same problem with expected chance agreement as measure (2) because it does not take into account the number of categories occurring in the coding scheme.</Paragraph>
      <Paragraph position="3"> Measure (4) is a different approach to measuring over multiple undifferentiated coders. Note that although Passonneau and Litman are looking at the presence or absence of discourse segment boundaries, measure (4) takes into account agreement that a prosodic phrase boundary is not a discourse segment boundary, and therefore treats the problem as a two-category distinction. Measure (4) falls foul of the same basic problem with chance agreement as measures (2) and (3), but in addition, the statistic itself guarantees at least 50% agreement by only pairing off coders against the majority opinion. It also introduces an &amp;quot;expert&amp;quot; coder by the back door in assuming that the majority is always right, although this stance is somewhat at odds with Passonneau and Litman's subsequent assessment of a boundary's strength, from one to seven, based on the number of coders who noticed it.</Paragraph>
      <Paragraph position="4"> Measure (1) looks at almost exactly the same type of problem as measure (4), the presence or absence of some kind of boundary. However, since one coder is explicitly designated as an &amp;quot;expert,&amp;quot; it does not treat the problem as a two-category distinction, but looks only at cases where either coder marked a boundary as present. Without knowing the density of conversational move boundaries in the corpus, this makes it difficult to assess how well the coders agreed on the absence of boundaries, or to compare measures (1) and (4). In addition, note that since false positives and missed negatives are rolled together in the denominator of the figure, measure (1) does not really distinguish expert and naive coder roles as much as it might. Nonetheless, this style of measure does have some advantages over measures (2), (3), and (4), since these measures produce artificially high agreement figures when one category of a set predominates, as is the case with boundary judgments. One would expect measure (1)'s results to be high under any circumstances, and it is not affected by the density of boundaries.</Paragraph>
      <Paragraph position="5"> So far, we have shown that all four of these measures produce figures that are at best, uninterpretable and at worst, misleading. KID make no comment about the meaning of their figures other than to say that the amount of agreement they show is reasonable; Silverman et al. simply point out that where figures are calculated over different numbers of categories, they are not comparable. On the other hand, Passonneau and Litman note that their figures are not properly interpretable and attempt to overcome this failing to some extent by showing that the agreement which they have obtained at least significantly differs from random agreement. Their method for showing this is complex and of no concern to us here, since all it tells us is that it is safe to assume that the coders were not coding randomly--reassuring, but no guarantee  Computational Linguistics Volume 22, Number 2 of reliability. It is more important to ask how different the results are from random and whether or not the data produced by coding is too noisy to use for the purpose for which it was collected.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="251" end_page="252" type="metho">
    <SectionTitle>
3. The Kappa Statistic
</SectionTitle>
    <Paragraph position="0"> The concerns of these researchers are largely the same as those in the field of content analysis (see especially Krippendorff \[1980\] and Weber \[1985\]), which has been through the same problems as we are currently facing and in which strong arguments have been made for using the kappa coefficient of agreement (Siegel and Castellan 1988) as a measure of reliability. 1 The kappa coefficient (K) measures pairwise agreement among a set of coders making category judgments, correcting for expected chance agreement:</Paragraph>
    <Paragraph position="2"> where P(A) is the proportion of times that the coders agree and P(E) is the proportion of times that we would expect them to agree by chance, calculated along the lines of the intuitive argument presented above. (For complete instructions on how to calculate K, see Siegel and Castellan \[1988\].) When there is no agreement other than that which would be expected by chance, K is zero. When there is total agreement, K is one. It is possible, and sometimes useful, to test whether or not K is significantly different from chance, but more importantly, interpretation of the scale of agreement is possible.</Paragraph>
    <Paragraph position="3"> Krippendorff (1980) discusses what constitutes an acceptable level of agreement, while giving the caveat that it depends entirely on what one intends to do with the coding. For instance, he claims that finding associations between two variables that both rely on coding schemes with K K .7 is often impossible, and says that content analysis researchers generally think of K &gt; .8 as good reliability, with .67 &lt; K &lt; .8 allowing tentative conclusions to be drawn. We would add two further caveats. First, although kappa addresses many of the problems we have been struggling with as a field, in order to compare K across studies, the underlying assumptions governing the calculation of chance expected agreement still require the units over which coding is performed to be chosen sensibly and comparably. (To see this, compare, for instance, what would happen to the statistic if the same discourse boundary agreement data were calculated variously over a base of clause boundaries, transcribed word boundaries, and transcribed phoneme boundaries.) Where no sensible choice of unit is available pretheoretically, measure (1) may still be preferred. Secondly, coding discourse and dialogue phenomena, and especially coding segment boundaries, may be inherently more difficult than many previous types of content analysis (for instance, 1 There are several variants of the kappa coefficient in the literature, including one, Scott's pi, which actually has been used at least once in our field, to assess agreement on move boundaries in monologues using action assembly theory (Grosz and Sidner 1986). Krippendorff's c~ is more general than Siegel and Castellan's K in that Krippendorff extends the argument from category data to interval and ratio scales; this extension might be useful for, for instance, judging the reliability of TOBI break index coding, since some researchers treat these codes as inherently scalar (Silverman et al. 1992). Krippendorff's c~ and Siegel and Castellan's K differ slightly when used on category judgments in the assumptions under which expected agreement is calculated. Here we use Siegel and Castellan's K because they explain their statistic more clearly, but the value of c~ is so closely related, especially under the usual expectations for reliability studies, that Krippendorff's statements about c~ hold, and we conflate the two under the more general name &amp;quot;kappa.&amp;quot; The advantages and disadvantages of different forms and extensions of kappa have been discussed in many fields but especially in medicine; see, for example, Berry (1992); Goldman (1992); Kraemer (1980); Soeken and Prescott (1986).</Paragraph>
    <Section position="1" start_page="252" end_page="252" type="sub_section">
      <SectionTitle>
Carletta Assessing Agreement
</SectionTitle>
      <Paragraph position="0"> dividing newspaper articles based on subject matter). Whether we have reached (or will be able to reach) a reasonable level of agreement in our work as a field remains to be seen; our point here is merely that if, as a community, we adopt clearer statistics, we will be able to compare results in a standard way across different coding schemes and experiments and to evaluate current developments--and that will illuminate both our individual results and the way forward.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="252" end_page="252" type="metho">
    <SectionTitle>
4. Expert Versus Naive Coders
</SectionTitle>
    <Paragraph position="0"> In assessing the amount of agreement among coders of category distinctions, the kappa statistic normalizes for the amount of expected chance agreement and allows a single measure to be calculated over multiple coders. This makes it applicable to the studies we have described, and more besides. However, we have yet to discuss the role of expert coders in such studies. KID designate one particular coder as the expert.</Paragraph>
    <Paragraph position="1"> Passonneau and Litman have only naive coders, but in essence have an expert opinion available on each unit classified in terms of the majority opinion. Silverman et al. treat all coders indistinguishably, although they do build an interesting argument about how agreement levels shift when a number of less-experienced transcribers are added to a pool of highly experienced ones. We would argue that in subjective codings such as these, there are no real experts. We concur with Krippendorff that what counts is how totally naive coders manage based on written instructions. Comparing naive and expert coding as KID do can be a useful exercise, but rather than assessing the naive coders' accuracy, it in fact measures how well the instructions convey what these researchers think they do. (Krippendorff gives well-established techniques that generalize on this sort of &amp;quot;odd-man-out&amp;quot; result, which involve isolating particular coders, categories, and kinds of units to establish the source of any disagreement.) In Passonneau and Litman, the reason for comparing to the majority opinion is less clean Despite our argument, there are occasions when one opinion should be treated as the expert one. For instance, one can imagine determining whether coders using a simplified coding scheme match what can be obtained by some better but more expensive method, which might itself be either objective or subjective. In these cases, we would argue that it is still appropriate to use the kappa statistic, in a variation which looks only at pairings of agreement with the expert opinion rather than at all possible pairs of coders. This variation could be achieved by interpreting P(A) as the proportion of times that the naive coders agree with the expert and P(E) as the proportion of times we would expect the naive coders to agree with the expert by chance.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML