XML Viewer - j04-1005

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/04/j04-1005_abstr.xml
Size: 12,210 bytes
Last Modified: 2025-10-06 13:43:23
<?xml version="1.0" standalone="yes"?>
<Paper uid="J04-1005">
  <Title>c(c) 2004 Association for Computational Linguistics Squibs and Discussions</Title>
  <Section position="2" start_page="96" end_page="99" type="abstr">
    <SectionTitle>
1. The Computation of P(E)
</SectionTitle>
    <Paragraph position="0"> P(E) is the probability of agreement among coders due to chance. The literature describes two different methods for estimating a probability distribution for random assignment of categories. In the first, each coder has a personal distribution, based on that coder's distribution of categories (Cohen 1960). In the second, there is one distribution for all coders, derived from the total proportions of categories assigned by all coders (Scott 1955; Fleiss 1971; Krippendorff 1980; Siegel and Castellan 1988).</Paragraph>
    <Paragraph position="1">  We now illustrate the computation of P(E) according to these two methods. We will then show that the resulting k Co and k S&amp;C may straddle one of the significant thresholds used to assess the raw k values.</Paragraph>
    <Paragraph position="2"> The assumptions underlying these two methods are made tangible in the way the data are visualized, in a contingency table for Cohen, and in what we will call an agreement table for the others. Consider the following situation. Two coders  code 150 occurrences of Okay and assign to them one of the two labels Accept or Ack(nowledgement) (Allen and Core 1997). The two coders label 70 occurrences as Accept, and another 55 as Ack. They disagree on 25 occurrences, which one coder labels as Ack, and the other as Accept. In Figure 1, this example is encoded by the top contingency table on the left (labeled Example 1) and the agreement table on the right. The contingency table directly mirrors our description. The agreement table is an N x m matrix, where N is the number of items in the data set and m is the number of labels that can be assigned to each object; in our example, N = 150 and m = 2. Each entry n ij is the number of codings of label j to item i. The agreement table in Figure 1 shows that occurrences 1 through 70 have been labeled as Accept by both coders, 71 through 125 as Ack by both coders, and 126 to 150 differ in their labels.</Paragraph>
    <Paragraph position="3"> 1 To be precise, Krippendorff uses a computation very similar to Siegel and Castellan's to produce a statistic called alpha. Krippendorff computes P(E) (called 1 [?] D e in his terminology) with a sampling-without-replacement methodology. The computations of P(E) and of 1 [?] D e show that the difference is negligible:</Paragraph>
    <Paragraph position="5"> (Cohen 1960) were originally devised for two coders. Each has been extended to more than two coders, for example, respectively Fleiss (1971) and Bartko and Carpenter (1976). Thus, without loss of generality, our examples involve two coders.</Paragraph>
    <Paragraph position="7"> Figure 1 Cohen's contingency tables (left) and Siegel and Castellan's agreement table (right). Agreement tables lose information. When the coders disagree, we cannot reconstruct which coder picked which category. Consider Example 2 in Figure 1. The two coders still disagree on 25 occurrences of Okay. However, one coder now labels 10 of those as Accept and the remaining 15 as Ack, whereas the other labels the same 10 as Ack and the same 15 as Accept. The agreement table does not change, but the contingency table does.</Paragraph>
    <Paragraph position="8"> Turning now to computing P(E), Figure 2 shows, for Example 1, Cohen's computation of P(E) on the left, and Siegel and Castellan's computation on the right. We include the computations of k Co and k S&amp;C as the last step. For both Cohen and Siegel and Castellan, P(A)=125/150 = 0.8333. The observed agreement P(A) is computed as the proportion of items the coders agree on to the total number of items; N is the number of items, and k the number of coders (N = 150 and k = 2 in our example).  according to the formulas in Cohen [1960] and Siegel and Castellan [1988], respectively).</Paragraph>
    <Paragraph position="9"> The difference between k Co and k S&amp;C in Figure 2 is just under 1%, however, the results of the two k computations straddle the value 0.67, which for better or worse has been adopted as a cutoff in computational linguistics. This cutoff is based on the assessment of k values in Krippendorff (1980), which discounts k&lt;0.67 and allows tentative conclusions when 0.67 [?] k&lt;0.8 and definite conclusions when k [?] 0.8. Krippendorff's scale has been adopted without question, even though Krippendorff himself considers it only a plausible standard that has emerged from his and his colleagues' work. In fact, Carletta et al. (1997) use words of caution against adopting Krippendorff's suggestion as a standard; the first author has also raised the issue of how to assess k values in Di Eugenio (2000).</Paragraph>
    <Paragraph position="10"> If Krippendorff's scale is supposed to be our standard, the example just worked out shows that the different computations of P(E) do affect the assessment of inter-coder agreement. If less-strict scales are adopted, the discrepancies between the two k computations play a larger role, as they have a larger effect on smaller values of k. For example, Rietveld and van Hout (1993) consider 0.20 &lt;k[?] 0.40 as indicating fair agreement, and 0.40 &lt;k[?] 0.60 as indicating moderate agreement. Suppose that two coders are coding 100 occurrences of Okay. The two coders label 40 occurrences as Accept and 25 as Ack. The remaining 35 are labeled as Ack by one coder and as Accept by the other (as in Example 6 in Figure 4); k</Paragraph>
    <Paragraph position="12"> = 0.27. These two values are really at odds.  Step 1. For each category j, compute the overall proportion p j,l of items assigned to j by each coder l. In a contingency table, each row and column total divided by N corresponds to one such proportion for the corresponding coder.  proportion of items assigned to j. In an agreement table, the column totals give the total counts for each category j, hence:</Paragraph>
    <Paragraph position="14"> Step 2. For a given item, the likelihood of both coders' independently agreeing on category j by</Paragraph>
    <Paragraph position="16"> Step 3. P(E), the likelihood of coders' accidentally assigning the same category to a given item, is</Paragraph>
    <Paragraph position="18"> The computation of P(E) and k according to Cohen (left) and to Siegel and Castellan (right).</Paragraph>
    <Paragraph position="19"> 2. Unpleasant Behaviors of Kappa: Prevalence and Bias In the computational linguistics literature, k has been used mostly to validate coding schemes: Namely, a &amp;quot;good&amp;quot; value of k means that the coders agree on the categories and therefore that those categories are &amp;quot;real.&amp;quot; We noted previously that assessing what constitutes a &amp;quot;good&amp;quot; value for k is problematic in itself and that different scales have been proposed. The problem is compounded by the following obvious effect on k values: If P(A) is kept constant, varying values for P(E) yield varying values of k. What can affect P(E) even if P(A) is constant are prevalence and bias.</Paragraph>
    <Paragraph position="20"> The prevalence problem arises because skewing the distribution of categories in the data increases P(E). The minimum value P(E)=1/m occurs when the labels are equally distributed among the m categories (see Example 4 in Figure 3). The maximum value P(E)=1 occurs when the labels are all concentrated in a single category. But for a given value of P(A), the larger the value of P(E), the lower the value of k.</Paragraph>
    <Paragraph position="21"> Example 3 and Example 4 in Figure 3 show two coders agreeing on 90 out of 100 occurrences of Okay, that is, P(A)=0.9. However, k ranges from [?]0.048 to 0.80, and from not significant to significant (the values of k S&amp;C for Examples 3 and 4 are the same as the values of k  The differences in k are due to the difference in the relative prevalence of the two categories Accept and Ack. In Example 3, the distribution is skewed, as there are 190 Accepts but only 10 Acks across the two coders; in Example 4, the distribution is even, as there are 100 Accepts and 100 Acks, respectively. These results do not depend on the size of the sample; that is, they are not due to the fact</Paragraph>
    <Paragraph position="23"> Contingency tables illustrating the bias effect on k Co .</Paragraph>
    <Paragraph position="24"> Example 3 and Example 4 are small. As the computations of P(A) and P(E) are based on proportions, the same distributions of categories in a much larger sample, say, 10,000 items, will result in exactly the same k values. Although this behavior follows squarely from k's definition, it is at odds with using k to assess a coding scheme. From both Example 3 and Example 4 we would like to conclude that the two coders are in substantial agreement, independent of the skewed prevalence of Accept with respect to Ack in Example 3. The role of prevalence in assessing k has been subject to heated discussion in the medical literature (Grove et al. 1981; Berry 1992; Goldman 1992).</Paragraph>
    <Paragraph position="25"> The bias problem occurs in k  each coder's individual probabilities. Thus, the less two coders agree in their overall behavior, the fewer chance agreements are expected. But for a given value of P(A), decreasing P(E) will increase k Co , leading to the paradox that k Co increases as the coders become less similar, that is, as the marginal totals diverge in the contingency table. Consider two coders coding the usual 100 occurrences of Okay, according to the two tables in Figure 4. In Example 5, the proportions of each category are very similar among coders, at 55 versus 60 Accept, and 45 versus 40 Ack. However, in Example 6 coder 1 favors Accept much more than coder 2 (75 versus 40 occurrences) and conversely chooses Ack much less frequently (25 versus 60 occurrences). In both cases, P(A) is 0.65 and k S&amp;C is stable at 0.27, but k Co goes from 0.27 to 0.418. Our initial example in Figure 1 is also affected by bias. The distribution in Example 1  propriate for coding in discourse and dialogue work. In fact, it appears to us that it holds in few if any of the published discourse- or dialogue-tagging efforts for which k has been computed. It is, for example, appropriate in situations in which item i may be tagged by different coders than item j (Fleiss 1971). However, k assessments for discourse and dialogue tagging are most often performed on the same portion of the data, which has been annotated by each of a small number of annotators (between two and four). In fact, in many cases the analysis of systematic disagreements among annotators on the same portion of the data (i.e., of bias) can be used to improve the coding scheme (Wiebe, Bruce, and O'Hara 1999).</Paragraph>
    <Paragraph position="26"> To use k Co but to guard against bias, Cicchetti and Feinstein (1990) suggest that k Co be supplemented, for each coding category, by two measures of agreement, positive and negative, between the coders. This means a total of 2m additional measures, which we believe are too many to gain a general insight into the meaning of the specific k Co value. Alternatively, Byrt, Bishop, and Carlin (1993) suggest that intercoder reliability be reported as three numbers: k Co and two adjustments of k  = 0.27, and 2P(A)[?]1 = 0.3.</Paragraph>
    <Paragraph position="27"> For both Examples 3 and 4, 2P(A)[?]1 = 0.8. Collectively, these three numbers appear to provide a means of better judging the meaning of k values. Reporting both k and 2P(A) [?] 1 may seem contradictory, as 2P(A) [?] 1 does not correct for expected agreement. However, when the distribution of categories is skewed, this highlights the effect of prevalence. Reporting both k Co and k S&amp;C does not invalidate our previous discussion, as we believe k Co is more appropriate for discourse- and dialogue-tagging in the majority of cases, especially when exploiting bias to improve coding (Wiebe, Bruce, and O'Hara 1999).</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML